This Algorithm Made All of AI Possible. It Was Born From a Math Feud.

A Russian mathematician picked a fight over free will, accidentally invented the most important idea in probability, and 120 years later it powers Google, nuclear physics, and every large language model you've ever used. Here's how Markov chains actually work.

03 Apr 2026•12 Mins read•By Ashish Kumar Verma

Every time you type a message and Gmail suggests the next three words, that's a Markov chain. Every time Google decides which result to show you first, that's a Markov chain. Every time ChatGPT generates a sentence, the foundational idea underneath all of that transformer machinery is still, at its core, a Markov chain. And the whole thing exists because two Russian mathematicians got into a fight about free will in 1906. I'm not exaggerating. The algorithm that made modern AI possible was invented out of spite.

the fight

In 1905, Russia was on fire. Socialist groups rose up against the Tsar, and the country split into two camps: Tsarists on one side, socialists on the other. This division was so deep that it crept into everything, including mathematics.

On the Tsar's side was Pavel Nekrasov, unofficially called the "Tsar of Probability." Nekrasov was deeply religious, politically powerful, and he had this idea that probability theory could be used to prove free will.

His argument went like this. For 200 years, the foundation of probability had been the law of large numbers: flip a coin 10 times and you might get 6 heads, but flip it 10,000 times and the ratio approaches 50:50. The average outcome converges to the expected value as you run more trials. But Jacob Bernoulli, who proved this in 1713, only proved it for independent events, meaning one coin flip doesn't affect the next.

Nekrasov looked at social statistics: marriage rates in Belgium, crime rates, birth rates. Year after year, these numbers converge to stable averages, following the law of large numbers. And since the law of large numbers requires independence (or so everyone thought), and these statistics come from human decisions, Nekrasov concluded that those decisions must be independent. Therefore, humans have free will. QED.

On the other side of this argument was Andrey Markov, also known as "Andrey the Furious." Atheist, with zero patience for nonsense, he publicly listed Nekrasov's work among the "abuses of mathematics." Markov didn't just think Nekrasov was wrong; he thought the entire logical chain was wrong. He believed the law of large numbers didn't require independence at all, and he was going to prove it.

a poem and some counting

Markov needed a system where events were clearly dependent on each other, something where you couldn't pretend the outcomes were independent, so he chose text. Specifically, he chose Eugene Onegin by Alexander Pushkin, the poem at the heart of Russian literature. He took the first 20,000 letters, stripped out all punctuation and spaces, and pushed them into one long string.

Then he started counting. 43% of the letters were vowels, and 57% were consonants. Next, he broke the string into overlapping pairs and counted the four possible combinations.

Pair	If independent (predicted)	Actual count
Vowel → Vowel	18%	6%
Vowel → Consonant	25%	37%
Consonant → Vowel	25%	37%
Consonant → Consonant	32%	20%

The actual values were wildly different from what independence would predict, with vowel-vowel pairs three times rarer than expected. The letters were clearly dependent. This makes intuitive sense if you think about it: after a consonant like "str", a vowel almost always comes next, while after a vowel, another vowel is unlikely in Russian. The current letter constrains what the next one can be. So Markov had his dependent system, and now he needed to show it still followed the law of large numbers.

the prediction machine

This is where it gets beautiful. Markov drew two circles, one for "vowel" (V) and one for "consonant" (C), which he called states. From each state, you can transition to either state, so he drew arrows, four arrows total.

Now he needed the transition probabilities. He knew 43% of all letters are vowels and that vowel-vowel pairs occur 6% of the time, so he calculated:

P(V → V) = 0.06 / 0.43 ≈ 0.13
P(V → C) = 1 - 0.13 = 0.87

And for consonants:

P(C → C) = 0.20 / 0.57 ≈ 0.67
P(C → V) = 1 - 0.67 = 0.33

That's it. Four numbers. And with just these four numbers, he built a prediction machine.

Start at a vowel and generate a random number. If it's below 0.13, stay at vowel; otherwise, move to consonant. From consonant, if the random number is below 0.67, stay at consonant; otherwise, move to vowel. Repeat this thousands of times and track the ratio of vowels to consonants.

At first the ratio jumps around wildly, but after enough steps, it converges to exactly 43% vowels and 57% consonants, the same split Markov had counted by hand from the poem. A dependent system, following the law of large numbers.

Try it yourself:

Markov Chain Explorer — Vowel / Consonant Transitions
V
vowel
→ 87%
← 33%
C
consonant
↺ 13%
↺ 67%
P(V → V): 13%
P(C → C): 67%
speed
steps
0
vowel %
—
consonant %
—
expected steady
27.5% V
Change the transition probabilities and watch how the steady-state distribution shifts. The green dashed line is the mathematically predicted convergence value.

Play with the transition probabilities: make them extreme, make them equal, make them asymmetric. The ratio always converges, the steady state always exists. The system is dependent, each step depends on the previous one, and the law of large numbers still holds.

Markov had built a chain of dependent events, a literal chain, and he proved mathematically that convergence happens anyway. He ended his paper with one final sentence aimed straight at Nekrasov:

"Thus, free will is not necessary to do probability."

Andrey the Furious, indeed.

what makes a Markov chain a Markov chain

Let me be precise about what Markov actually formalized, because this is the part that makes everything else work. A Markov chain has one defining property, the memoryless property, also called the Markov property:

P(next state | current state, all previous states) = P(next state | current state)

The future depends on the present, not the past. Only where you are right now matters, not how you got there.

Think of it like this: you're standing at a crossroads, and there are three paths forward. The probability that you take each path depends on which crossroads you're at, but it does not depend on which crossroads you came from, or the one before that, or the one before that. One state, one set of transition probabilities, no memory of history.

This sounds like a limitation, but it's actually a superpower. Because it means you can take enormously complex systems, systems with millions of possible states and long complicated histories, and ignore almost all of that complexity. Just look at where the system is right now and predict what happens next.

That simplification is what makes Markov chains computable, and computability is what makes them useful.

the bomb

Markov himself didn't care about applications, writing: "I am concerned only with questions of pure analysis. I refer to the question of the applicability with indifference." But applicability came anyway, and it came with a bang. Literally.

In January 1946, a mathematician named Stanislaw Ulam was recovering from encephalitis. He nearly died and spent months in bed, and to pass the time, he played solitaire. After hundreds of games, a question nagged at him: what are the chances that a randomly shuffled game of solitaire can actually be won?

There are 52! possible arrangements of a deck, which is about 8 × 10⁶⁷ possible games. You can't check them all, and you can't solve this analytically, but what if you just play hundreds of random games and count the wins?

This flash of insight followed Ulam back to Los Alamos, where he was working on nuclear weapons. The problem there was understanding how neutrons behave inside a uranium core. A neutron hits a U-235 nucleus, and the nucleus splits, releasing energy and 2-3 more neutrons. If those new neutrons hit more nuclei, you get a chain reaction. But the physics is complicated: neutrons scatter, get absorbed, escape the core, or cause fission, all with probabilities that depend on position, energy, and velocity.

Trillions of neutrons, trillions of possible outcomes, analytically unsolvable. Ulam shared his solitaire insight with John von Neumann: what if we just simulate it? Generate random outcomes, over and over, and use the statistics to approximate the answer.

Von Neumann immediately saw the power, but he also spotted the problem. Solitaire games are independent, but neutrons aren't. A neutron's behavior depends on where it is and what it's done before. You need a Markov chain.

So they built one. A neutron starts traveling through the core, and at each step it can scatter off an atom and keep traveling (arrow back to itself), it can escape or get absorbed (chain ends), or it can hit U-235 and cause fission, releasing 2-3 new neutrons that start their own chains. The transition probabilities depend on the neutron's current state: its position, velocity, energy. Memoryless, because only the present matters.

Run this chain hundreds of times and track how many neutrons are produced per initial neutron. That gives you k, the multiplication factor. If k < 1, the reaction dies; if k = 1, it's sustained but stable; if k > 1, you get exponential growth. Bomb.

Monte Carlo Neutron Simulation
Simulate neutron chains in a uranium core. Adjust the mass and see if the reaction goes critical.
uranium mass (relative): 50%
simulations: 200
Each simulation traces a neutron through scatter/fission/escape events using random sampling. This is the Monte Carlo method — the same approach Ulam and von Neumann used at Los Alamos.

Slide the uranium mass up and watch the histogram shift from blue (subcritical) to red (supercritical). That transition is the critical mass, the minimum amount of fissile material needed for a bomb.

Ulam named the method after his gambling uncle's favorite casino: Monte Carlo. The Monte Carlo method, born from solitaire, powered by Markov chains, used to build nuclear weapons. And today it's used in everything from financial modeling to protein folding to climate simulation.

Ulam later wrote: "It is still an unending source of surprise for me to see how a few scribbles on a blackboard could change the course of human affairs."

the search engine

Fast forward to the 1990s. The internet explodes, thousands of pages appear daily, and nobody can find anything.

Early search engines like Yahoo and Lycos ranked pages by counting how many times your search term appeared on the page, using simple keyword matching. The problem was that it was trivially easy to game: hide keywords in white text on a white background, repeat them a thousand times, and boom, top ranking. There was no notion of quality, only relevance.

Two Stanford PhD students, Sergey Brin and Larry Page, realized that links between pages contained information about quality. If a page has many other pages linking to it, it's probably important, and if those linking pages are themselves important, the endorsement means even more. They modeled the entire web as a Markov chain.

Each webpage is a state, and each link is a transition. If Amy's page links only to Ben, there's a 100% chance of transitioning from Amy to Ben; if Ben links to three pages, each gets a 33% chance. Now imagine a "random surfer" starting on a random page and following links randomly. Over thousands of steps, the fraction of time spent on each page converges to a stable distribution, and pages where the surfer spends more time are ranked higher.

But there's a problem: some parts of the web aren't connected. The surfer can get stuck in a loop, bouncing between a small cluster of pages forever, never reaching the rest. So they added a damping factor: 85% of the time, follow a link; 15% of the time, jump to a completely random page. This ensures the chain is ergodic, meaning every state is reachable from every other state, and convergence is guaranteed.

PageRank — The Web as a Markov Chain
Watch how a random surfer's visits converge to stable page rankings
damping factor (d): 0.85 — probability of following a link vs random jump
iterations: 0
ranking: Amy > Ben > Chris > Dan
Power iteration computes exact scores mathematically. Random surfer simulates a person clicking links. Both converge to the same ranking.

Try both modes. Power iteration computes the exact mathematical scores, while the random surfer simulates what actually happens. They converge to the same ranking.

Notice how Ben consistently ranks highest, even though every node starts with equal score. That's because Ben has the most incoming links from pages that themselves have importance, which shows that quality links matter more than quantity.

They called the algorithm PageRank ("Page" for web pages, and also because Larry's last name was Page; he snuck that in). Initially they called the search engine BackRub, after the backlinks it analyzed, but then they decided they needed a name as big as their ambition. They thought of the largest number they knew: 10¹⁰⁰, a googol. When registering the domain, they misspelled it.

Google was born. Today, Alphabet (Google's parent company) is worth over $2 trillion, and at the heart of it is a Markov chain that only looks at the current state to predict what happens next.

from letters to language models

Remember how Markov started all this by predicting vowels and consonants in a poem? In the 1940s, Claude Shannon, the father of information theory, took this idea further. Instead of vowels and consonants, what if you used individual letters as states, and instead of looking at just the last letter, what if you looked at the last two, or three?

With a first-order model (predict based on the previous letter), you get gibberish. With a second-order model, you start seeing recognizable fragments like "th", "the", "ing". With higher orders, the generated text starts looking almost sensible. Then Shannon asked: what if instead of letters, you use words as states?

The generated text got much better because sequences of three or four words often made sense. "Attack on an English writer" is grammatically valid. The predictions improved because each step carried more context, and Shannon had discovered something fundamental: the more context you include, the better your predictions become.

This is exactly what modern language models do. At the most abstract level, a large language model is answering the same question Markov asked in 1906:

Given what came before, what comes next?

Text Prediction with Markov Chains
Type a starting word and watch the chain predict the next word, one step at a time. This is the same core idea behind Shannon's work and, ultimately, modern language models.
starting word
chain length: 15
Each word is predicted based only on the previous word. This is a first-order Markov chain for text. Modern LLMs extend this idea by looking at thousands of previous tokens and using attention to weight them.

That's a simple first-order word-level Markov chain, where each word is predicted based only on the previous word. It produces text that's sometimes grammatical but rarely meaningful.

Modern LLMs do the same thing, but at a completely different scale. Instead of looking at one previous word, they look at thousands of previous tokens (words, parts of words, punctuation). Instead of storing transition probabilities in a lookup table, they learn them as billions of neural network parameters. And the key innovation, attention, is essentially a mechanism for deciding which of those thousands of previous tokens matter most for the current prediction. In the phrase "the structure of the cell", attention lets the model look back at "blood" and "mitochondria" from earlier in the context to decide that "cell" means biology, not prison.

But strip away the transformers, the attention heads, the billions of parameters, the RLHF, the prompt engineering, and the fundamental question is still:

P(next token | all previous tokens)

It's Markov's question, extended, scaled up, made deep, but the same question.

the memoryless property is the point

Weather systems have histories stretching back millions of years, text has context going back entire books, the web has billions of interconnected pages with link structures built over decades, and nuclear reactions involve trillions of particles with complex interaction histories. And for many of these systems, you can ignore almost all of that history. You can just look at the current state and make a meaningful prediction about what happens next.

That's not a hack or an approximation born from laziness; it's a deep structural property of many real-world systems. The memoryless property is what makes these systems computable, because without it, you'd need to track the entire history of every state to make a prediction. With it, you just need to know where you are right now.

As one paper put it: "Problem-solving is often a matter of cooking up an appropriate Markov chain." And it all started because one furious Russian mathematician wanted to prove another one wrong.

the math for the nerds who stayed

Here's the formal machinery. Definition. A (discrete-time, discrete-state) Markov chain is a sequence of random variables X₀, X₁, X₂, ... satisfying:

P(Xₙ₊₁ = j | Xₙ = i, Xₙ₋₁ = iₙ₋₁, ..., X₀ = i₀) = P(Xₙ₊₁ = j | Xₙ = i) = Pᵢⱼ

The transition matrix P has entries Pᵢⱼ where each row sums to 1 (it's a stochastic matrix).

Stationary distribution. A probability vector π is stationary if:

π = πP

Meaning: if the system is in distribution π, one step later it's still in distribution π. The system has "settled."

For the Markov chain we built from Eugene Onegin:

P = | 0.13  0.87 |
    | 0.33  0.67 |

Solving πP = π with π₁ + π₂ = 1:

π₁ = 0.33 / (0.87 + 0.33) = 0.275  (hmm, not quite 0.43)

Wait. The stationary distribution of the simple two-state chain gives approximately 27.5% vowels, not 43%. That's because the transition matrix encodes conditional probabilities, and the actual letter frequencies in the poem reflect more complex structure than a two-state model captures. Markov's original analysis was more nuanced: he used the transition probabilities to prove the theorem (law of large numbers for dependent events), not to perfectly reconstruct the marginal distribution from a simplified model.

The key theorem is:

Ergodic theorem for Markov chains. If a Markov chain is irreducible (every state reachable from every other) and aperiodic (doesn't cycle with a fixed period), then:

A unique stationary distribution π exists
For any starting state, the distribution converges to π
The time-average fraction spent in state j converges to πⱼ

This is the generalized law of large numbers for dependent events, Markov's core result.

PageRank formulation. For a web graph with N pages, the transition matrix is:

M = d × A + (1 - d) × (1/N) × J

Where:

A is the link-based transition matrix (Aᵢⱼ = 1/outlinks(i) if page i links to page j)
d = 0.85 is the damping factor
J is the N×N all-ones matrix
The (1-d) term is the "random jump" that ensures ergodicity

The PageRank vector is the stationary distribution of M:

π = πM

Monte Carlo estimation. For any function f of the state, the Monte Carlo estimate after T steps is:

f̂ = (1/T) Σₜ f(Xₜ)

By the ergodic theorem, f̂ → E_π[f] as T → ∞. The estimation error decreases as O(1/√T), which means doubling accuracy requires 4x the samples. This is dimension-independent, which is why Monte Carlo works well in high-dimensional spaces where grid-based methods fail exponentially.

Connection to LLMs. A language model defines:

P(xₜ | x₁, x₂, ..., xₜ₋₁)

This is not Markov in the strict sense because it conditions on the entire history, not just the current state. But the autoregressive generation process (sample one token, append it, sample the next) is a Markov process in the space of sequences. The "state" is the entire sequence so far, and each generation step depends only on the current sequence, not on which random samples led to it. The Markov property holds at the level of the generation trajectory.

The chain is: generate token → update context → generate token → update context, and each step is memoryless with respect to the process of generation. The model doesn't remember which candidates it considered and rejected; only the sequence itself persists.

So even modern LLMs, in a deep structural sense, are running a Markov chain, just one with an astronomically large state space. Markov would probably be indifferent about this, since he was concerned only with pure analysis, after all.

But I think it's worth saying: a fight about free will in Tsarist Russia led to the mathematical foundation under Google, nuclear physics, Monte Carlo simulation, and every large language model on the planet. That's a pretty good chain of events.

~ Ashish Kumar Verma

09 Apr 2026 • 13 Mins read

Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science

I read Anthropic's 243-page Claude Mythos system card. It starts as safety science and ends as 20 pages of employees marveling at their model's creative writing. The real story is buried in section 5.8.1, where they admit the circularity.

01 Apr 2026 • 08 Mins read

The Claude Code Source Map Story Is Funny Until You Think About What It Means

The Claude Code source map story turned into a public dunk fest, but the real lesson is about build hygiene, trust boundaries, and how exposed modern AI tools can get.

28 Mar 2026 • 07 Mins read

Everything I Got Wrong About AI Coding Agents

A practical breakdown on what actually goes wrong with AI coding agents in production, from instruction overload and horizontal planning to why humans still need to do the real thinking. Written from experience shipping real systems at scale.

10 May 2026 • 15 Mins read

Flip your order and it works? The system prompt is the worst place for your 'do not' rules

I had a hunch the standard system → user prompt order was wrong for negative constraints. Every standard answer told me to drop it. I ran 1500 calls on 30 hard tasks and the flipped version beat the standard one by 14 to 46 points. The system slot is the worst place for your 'do not' rules and the docs back it.

02 Apr 2026 • 09 Mins read

The Most Important Equation in Neuroscience Has Four Coupled Differential Equations

I went down a neuroscience rabbit hole and rebuilt the Hodgkin-Huxley model from scratch. Four differential equations that describe how neurons fire, predicted molecular structures decades before anyone could see them, and won a Nobel Prize. Here's how they work, with interactive simulations.

05 Apr 2026 • 05 Mins read

You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.

Karpathy shipped autoresearch and named a pattern everyone could already run. The real insight is that the loop works anywhere you can define a verifiable output. You can apply it today, in any field, without waiting for a framework.

14 Apr 2026 • 08 Mins read

Honestly, MCP Is Just npm For Agent Tools

A friend asked what MCP actually gives you over plain APIs. I tried to defend it. Every defense line collapsed under one simple question. Here's the honest walk through why MCP is a packaging standard, not a technical innovation, and why that's still valuable.

04 Apr 2026 • 14 Mins read

Gemma 4 Dropped and Everyone Started Talking About PLE Again. So I Built It.

Gemma 4 dragged Per-Layer Embeddings back onto the timeline, so I rebuilt PLE from scratch, hit the dead ends honestly, and measured what actually worked.

29 Mar 2026 • 06 Mins read

I Contacted the Author of Project Hail Mary About Wrong Fuel Calculations and He Agreed

I recreated the relativistic fuel math from Project Hail Mary, found a stopping problem in the calculations, reached out to Andy Weir directly, and got confirmation of where the numbers went wrong. The ship doesn't stop, and here's the physics that explains why.

14 Apr 2026 • 11 Mins read