Every time you type a message and Gmail suggests the next three words, that's a Markov chain.
Every time Google decides which result to show you first, that's a Markov chain.
Every time ChatGPT generates a sentence, the foundational idea underneath all of that transformer machinery is still, at its core, a Markov chain.
And the whole thing exists because two Russian mathematicians got into a fight about free will in 1906.
I'm not exaggerating. The algorithm that made modern AI possible was invented out of spite.
the fight
In 1905, Russia was on fire. Socialist groups rose up against the Tsar. The country split into two camps. Tsarists on one side, socialists on the other. And this division was so deep that it crept into everything. Including mathematics.
On the Tsar's side was Pavel Nekrasov. Unofficially called the "Tsar of Probability." Nekrasov was deeply religious, politically powerful, and he had this idea that probability theory could be used to prove free will.
His argument went like this. For 200 years, the foundation of probability had been the law of large numbers. Flip a coin 10 times, you might get 6 heads. Flip it 10,000 times, the ratio approaches 50:50. The average outcome converges to the expected value as you run more trials.
But Jacob Bernoulli, who proved this in 1713, only proved it for independent events. One coin flip doesn't affect the next.
Nekrasov looked at social statistics. Marriage rates in Belgium. Crime rates. Birth rates. Year after year, these numbers converge to stable averages. They follow the law of large numbers. And since the law of large numbers requires independence (or so everyone thought), and these statistics come from human decisions...
Nekrasov concluded: those decisions must be independent. Therefore, humans have free will. QED.
Yeah.
On the other side of this argument was Andrey Markov. Also known as "Andrey the Furious." Atheist. Zero patience for nonsense. He publicly listed Nekrasov's work among the "abuses of mathematics."
Markov didn't just think Nekrasov was wrong. He thought the entire logical chain was wrong. He believed the law of large numbers didn't require independence at all. And he was going to prove it.
a poem and some counting
Markov needed a system where events were clearly dependent on each other. Something where you couldn't pretend the outcomes were independent.
He chose text.
Specifically, he chose Eugene Onegin by Alexander Pushkin. The poem at the heart of Russian literature. He took the first 20,000 letters, stripped out all punctuation and spaces, and pushed them into one long string.
Then he started counting.
43% of the letters were vowels. 57% were consonants.
Next, he broke the string into overlapping pairs and counted the four possible combinations.
| Pair | If independent (predicted) | Actual count |
|---|---|---|
| Vowel → Vowel | 18% | 6% |
| Vowel → Consonant | 25% | 37% |
| Consonant → Vowel | 25% | 37% |
| Consonant → Consonant | 32% | 20% |
The actual values were wildly different from what independence would predict. Vowel-vowel pairs were three times rarer than expected. The letters were clearly dependent.
This makes intuitive sense if you think about it. After a consonant like "str", a vowel almost always comes next. After a vowel, another vowel is unlikely in Russian. The current letter constrains what the next one can be.
So Markov had his dependent system. Now he needed to show it still followed the law of large numbers.
the prediction machine
This is where it gets beautiful.
Markov drew two circles. One for "vowel" (V), one for "consonant" (C). These are states.
From each state, you can transition to either state. So he drew arrows. Four arrows total.
Now he needed the transition probabilities.
He knew 43% of all letters are vowels. He knew vowel-vowel pairs occur 6% of the time. So:
P(V → V) = 0.06 / 0.43 ≈ 0.13
P(V → C) = 1 - 0.13 = 0.87
And for consonants:
P(C → C) = 0.20 / 0.57 ≈ 0.67
P(C → V) = 1 - 0.67 = 0.33
That's it. Four numbers. And with just these four numbers, he built a prediction machine.
Start at a vowel. Generate a random number. If it's below 0.13, stay at vowel. Otherwise, move to consonant. From consonant, if the random number is below 0.67, stay at consonant. Otherwise, move to vowel.
Repeat this thousands of times and track the ratio of vowels to consonants.
At first the ratio jumps around wildly. But after enough steps, it converges. To exactly 43% vowels and 57% consonants. The same split Markov had counted by hand from the poem.
A dependent system. Following the law of large numbers.
Try it yourself:
Play with the transition probabilities. Make them extreme, make them equal, make them asymmetric. The ratio always converges. The steady state always exists. The system is dependent, each step depends on the previous one, and the law of large numbers still holds.
Markov had built a chain of dependent events. A literal chain. And he proved mathematically that convergence happens anyway.
He ended his paper with one final sentence aimed straight at Nekrasov:
"Thus, free will is not necessary to do probability."
Andrey the Furious, indeed.
what makes a Markov chain a Markov chain
Let me be precise about what Markov actually formalized, because this is the part that makes everything else work.
A Markov chain has one defining property. The memoryless property, also called the Markov property:
P(next state | current state, all previous states) = P(next state | current state)
The future depends on the present, not the past. Only where you are right now matters. Not how you got there.
Think of it like this. You're standing at a crossroads. There are three paths forward. The probability that you take each path depends on which crossroads you're at. It does not depend on which crossroads you came from, or the one before that, or the one before that.
One state. One set of transition probabilities. No memory of history.
This sounds like a limitation. It's actually a superpower. Because it means you can take enormously complex systems, systems with millions of possible states and long complicated histories, and ignore almost all of that complexity. Just look at where the system is right now. Predict what happens next.
That simplification is what makes Markov chains computable. And computability is what makes them useful.
the bomb
Markov himself didn't care about applications. He wrote: "I am concerned only with questions of pure analysis. I refer to the question of the applicability with indifference."
But applicability came anyway. And it came with a bang. Literally.
In January 1946, a mathematician named Stanislaw Ulam was recovering from encephalitis. Nearly died. Spent months in bed. And to pass the time, he played solitaire.
After hundreds of games, a question nagged at him. What are the chances that a randomly shuffled game of solitaire can actually be won?
There are 52! possible arrangements of a deck. That's about 8 × 10⁶⁷ possible games. You can't check them all. You can't solve this analytically. But...
What if you just play hundreds of random games and count the wins?
This flash of insight followed Ulam back to Los Alamos, where he was working on nuclear weapons. The problem there was understanding how neutrons behave inside a uranium core.
A neutron hits a U-235 nucleus. The nucleus splits, releasing energy and 2-3 more neutrons. If those new neutrons hit more nuclei, you get a chain reaction. But the physics is complicated. Neutrons scatter, get absorbed, escape the core, or cause fission, all with probabilities that depend on position, energy, and velocity.
Trillions of neutrons. Trillions of possible outcomes. Analytically unsolvable.
Ulam shared his solitaire insight with John von Neumann: what if we just simulate it? Generate random outcomes, over and over, and use the statistics to approximate the answer?
Von Neumann immediately saw the power. But he also spotted the problem. Solitaire games are independent. Neutrons aren't. A neutron's behavior depends on where it is and what it's done before.
You need a Markov chain.
So they built one. A neutron starts traveling through the core. At each step:
- It can scatter off an atom and keep traveling (arrow back to itself)
- It can escape or get absorbed (chain ends)
- It can hit U-235 and cause fission, releasing 2-3 new neutrons that start their own chains
The transition probabilities depend on the neutron's current state: its position, velocity, energy. Memoryless. Only the present matters.
Run this chain hundreds of times. Track how many neutrons are produced per initial neutron. That gives you k, the multiplication factor.
If k < 1: reaction dies. If k = 1: sustained but stable. If k > 1: exponential growth. Bomb.
Slide the uranium mass up. Watch the histogram shift from blue (subcritical) to red (supercritical). That transition is the critical mass. The minimum amount of fissile material needed for a bomb.
Ulam named the method after his gambling uncle's favorite casino. Monte Carlo.
The Monte Carlo method. Born from solitaire, powered by Markov chains, used to build nuclear weapons. And today it's used in everything from financial modeling to protein folding to climate simulation.
Ulam later wrote: "It is still an unending source of surprise for me to see how a few scribbles on a blackboard could change the course of human affairs."
the search engine
Fast forward to the 1990s. The internet explodes. Thousands of pages appear daily. And nobody can find anything.
Early search engines like Yahoo and Lycos ranked pages by counting how many times your search term appeared on the page. Simple keyword matching.
The problem? It was trivially easy to game. Hide keywords in white text on a white background. Repeat them a thousand times. Boom, top ranking.
There was no notion of quality. Only relevance.
Two Stanford PhD students, Sergey Brin and Larry Page, realized that links between pages contained information about quality. If a page has many other pages linking to it, it's probably important. And if those linking pages are themselves important, the endorsement means even more.
They modeled the entire web as a Markov chain.
Each webpage is a state. Each link is a transition. If Amy's page links only to Ben, there's a 100% chance of transitioning from Amy to Ben. If Ben links to three pages, each gets a 33% chance.
Now imagine a "random surfer" starting on a random page. They follow links randomly. Over thousands of steps, the fraction of time spent on each page converges to a stable distribution. Pages where the surfer spends more time are ranked higher.
But there's a problem. Some parts of the web aren't connected. The surfer can get stuck in a loop, bouncing between a small cluster of pages forever, never reaching the rest.
So they added a damping factor. 85% of the time, follow a link. 15% of the time, jump to a completely random page. This ensures the chain is ergodic. Every state is reachable from every other state. Convergence is guaranteed.
Try both modes. Power iteration computes the exact mathematical scores. The random surfer simulates what actually happens. They converge to the same ranking.
Notice how Ben consistently ranks highest, even though every node starts with equal score. That's because Ben has the most incoming links from pages that themselves have importance. Quality links matter more than quantity.
They called the algorithm PageRank. ("Page" for web pages, and also because Larry's last name was Page. He snuck that in.)
Initially they called the search engine BackRub, after the backlinks it analyzed. Then they decided they needed a name as big as their ambition. They thought of the largest number they knew. 10¹⁰⁰. A googol.
When registering the domain, they misspelled it.
Google was born.
Today, Alphabet (Google's parent company) is worth over $2 trillion. And at the heart of it is a Markov chain that only looks at the current state to predict what happens next.
from letters to language models
Remember how Markov started all this? Predicting vowels and consonants in a poem.
In the 1940s, Claude Shannon, the father of information theory, took this idea further. Instead of vowels and consonants, what if you used individual letters as states? And instead of looking at just the last letter, what if you looked at the last two? Or three?
With a first-order model (predict based on the previous letter), you get gibberish. With a second-order model, you start seeing recognizable fragments. "th", "the", "ing". With higher orders, the generated text starts looking almost sensible.
Then Shannon asked: what if instead of letters, you use words as states?
The generated text got much better. Sequences of three or four words often made sense. "Attack on an English writer" is grammatically valid. The predictions improved because each step carried more context.
Shannon had discovered something fundamental. The more context you include, the better your predictions become.
This is exactly what modern language models do. At the most abstract level, a large language model is answering the same question Markov asked in 1906:
Given what came before, what comes next?
That's a simple first-order word-level Markov chain. Each word is predicted based only on the previous word. It produces text that's sometimes grammatical but rarely meaningful.
Modern LLMs do the same thing, but at a completely different scale. Instead of looking at one previous word, they look at thousands of previous tokens (words, parts of words, punctuation). Instead of storing transition probabilities in a lookup table, they learn them as billions of neural network parameters.
And the key innovation, attention, is essentially a mechanism for deciding which of those thousands of previous tokens matter most for the current prediction. In the phrase "the structure of the cell", attention lets the model look back at "blood" and "mitochondria" from earlier in the context to decide that "cell" means biology, not prison.
But strip away the transformers, the attention heads, the billions of parameters, the RLHF, the prompt engineering. The fundamental question is still:
P(next token | all previous tokens)
It's Markov's question. Extended. Scaled up. Made deep. But the same question.
the memoryless property is the point
Here's what still gets me about all of this.
Weather systems have histories stretching back millions of years. Text has context going back entire books. The web has billions of interconnected pages with link structures built over decades. Nuclear reactions involve trillions of particles with complex interaction histories.
And for many of these systems, you can ignore almost all of that history. You can just look at the current state and make a meaningful prediction about what happens next.
That's not a hack. That's not an approximation born from laziness. It's a deep structural property of many real-world systems.
The memoryless property is what makes these systems computable. Without it, you'd need to track the entire history of every state to make a prediction. With it, you just need to know where you are right now.
As one paper put it: "Problem-solving is often a matter of cooking up an appropriate Markov chain."
And it all started because one furious Russian mathematician wanted to prove another one wrong.
the math for the nerds who stayed
Here's the formal machinery.
Definition. A (discrete-time, discrete-state) Markov chain is a sequence of random variables X₀, X₁, X₂, ... satisfying:
P(Xₙ₊₁ = j | Xₙ = i, Xₙ₋₁ = iₙ₋₁, ..., X₀ = i₀) = P(Xₙ₊₁ = j | Xₙ = i) = Pᵢⱼ
The transition matrix P has entries Pᵢⱼ where each row sums to 1 (it's a stochastic matrix).
Stationary distribution. A probability vector π is stationary if:
π = πP
Meaning: if the system is in distribution π, one step later it's still in distribution π. The system has "settled."
For the Markov chain we built from Eugene Onegin:
P = | 0.13 0.87 |
| 0.33 0.67 |
Solving πP = π with π₁ + π₂ = 1:
π₁ = 0.33 / (0.87 + 0.33) = 0.275 (hmm, not quite 0.43)
Wait. The stationary distribution of the simple two-state chain gives approximately 27.5% vowels, not 43%. That's because the transition matrix encodes conditional probabilities, and the actual letter frequencies in the poem reflect more complex structure than a two-state model captures. Markov's original analysis was more nuanced. He used the transition probabilities to prove the theorem (law of large numbers for dependent events), not to perfectly reconstruct the marginal distribution from a simplified model.
The key theorem is:
Ergodic theorem for Markov chains. If a Markov chain is irreducible (every state reachable from every other) and aperiodic (doesn't cycle with a fixed period), then:
- A unique stationary distribution π exists
- For any starting state, the distribution converges to π
- The time-average fraction spent in state j converges to πⱼ
This is the generalized law of large numbers for dependent events. Markov's core result.
PageRank formulation. For a web graph with N pages, the transition matrix is:
M = d × A + (1 - d) × (1/N) × J
Where:
- A is the link-based transition matrix (Aᵢⱼ = 1/outlinks(i) if page i links to page j)
- d = 0.85 is the damping factor
- J is the N×N all-ones matrix
- The (1-d) term is the "random jump" that ensures ergodicity
The PageRank vector is the stationary distribution of M:
π = πM
Monte Carlo estimation. For any function f of the state, the Monte Carlo estimate after T steps is:
f̂ = (1/T) Σₜ f(Xₜ)
By the ergodic theorem, f̂ → E_π[f] as T → ∞. The estimation error decreases as O(1/√T), which means doubling accuracy requires 4x the samples. This is dimension-independent, which is why Monte Carlo works well in high-dimensional spaces where grid-based methods fail exponentially.
Connection to LLMs. A language model defines:
P(xₜ | x₁, x₂, ..., xₜ₋₁)
This is not Markov in the strict sense because it conditions on the entire history, not just the current state. But the autoregressive generation process (sample one token, append it, sample the next) is a Markov process in the space of sequences. The "state" is the entire sequence so far. Each generation step depends only on the current sequence, not on which random samples led to it. The Markov property holds at the level of the generation trajectory.
The chain is: generate token → update context → generate token → update context → ...
Each step is memoryless with respect to the process of generation. The model doesn't remember which candidates it considered and rejected. Only the sequence itself persists.
So even modern LLMs, in a deep structural sense, are running a Markov chain. Just one with an astronomically large state space.
Markov would probably be indifferent about this. He was concerned only with pure analysis, after all.
But I think it's worth saying: a fight about free will in Tsarist Russia led to the mathematical foundation under Google, nuclear physics, Monte Carlo simulation, and every large language model on the planet.
That's a pretty good chain of events.
~ Ashish Kumar Verma