Interactive previews
Live, in-browser diagrams. Each one previews a concept the engine will implement.
Milestone M0 · the first and last thing every prompt touches
Before the model sees anything, text is split into integer token IDs from a learned vocabulary — common chunks become one token, rare ones split into pieces. Type below and watch it tokenize. This is a toy byte-pair encoder with a tiny merge table; the real M0 tokenizer uses a learned vocabulary of tens of thousands of merges.
␣ chips). Whitespace and casing
change the tokens, which is why they matter.Milestone M3 · the last step of every token
The model’s forward pass ends with a logit (a raw score) for every word in the vocabulary. To pick the next word we turn those scores into probabilities with softmax, then choose. Three knobs shape that choice — drag them and watch the distribution move. The math here is the real softmax; only the logits are a fixed toy example.
The cat sat on the ___
The survivors are renormalized to sum to 100%, and Sample a token draws from that final distribution — exactly what the generation loop will do, one token at a time.
Milestone M2 · the heart of the transformer
Inside each transformer block, a token builds its next representation by looking at the tokens before it and taking a weighted average of them. The weights come from how well a token’s query matches each other token’s key (Q·Kᵀ → softmax). Pick a token to see what it attends to. The softmax is real; the token vectors are a fixed toy example.
Milestone M4 · why decode isn’t quadratic
To produce the next token, attention needs the keys and values of every earlier token. Recomputing them at every step would be quadratic waste — so we cache each token’s K/V once and reuse it. Step through generation and watch the difference.
On the way
These arrive as their milestones land — see the abstraction ladder for where each one sits.