Diagrams

Interactive previews

Live, in-browser diagrams. Each one previews a concept the engine will implement.

Milestone M0 · the first and last thing every prompt touches

Tokenizer: text becomes numbers

Before the model sees anything, text is split into integer token IDs from a learned vocabulary — common chunks become one token, rare ones split into pieces. Type below and watch it tokenize. This is a toy byte-pair encoder with a tiny merge table; the real M0 tokenizer uses a learned vocabulary of tens of thousands of merges.

your text

What you’re seeing

Each chip is one token; the small number is its ID — an index into the vocabulary. The model only ever sees these IDs, never the letters.
Spaces are tokens too (the dim ␣ chips). Whitespace and casing change the tokens, which is why they matter.
Byte-pair encoding starts from single characters and keeps merging the most common adjacent pair, so frequent words collapse to one token while unusual strings stay in pieces.

📖 Inference Engineering (Kiely) · §2.2 (p.46) 🔧 ds4 · ds4.c (hash-table vocab)

Milestone M3 · the last step of every token

From logits to a token: sampling

The model’s forward pass ends with a logit (a raw score) for every word in the vocabulary. To pick the next word we turn those scores into probabilities with softmax, then choose. Three knobs shape that choice — drag them and watch the distribution move. The math here is the real softmax; only the logits are a fixed toy example.

The cat sat on the ___

temperature = 1.00

top-k = 8

top-p = 1.00

What each knob does

Temperature divides the logits before softmax. Low (→0) sharpens toward the single top token (greedy); high (→2) flattens the distribution so unlikely words get a real chance. Bars above scale with the post-temperature probability, so you can watch them even out as you turn it up.
Top-k keeps only the k highest-probability tokens and discards the rest (dimmed). A hard cap on how many candidates survive.
Top-p (nucleus) keeps the smallest set of tokens whose probabilities add up to p — an adaptive cutoff that keeps more candidates when the model is unsure, fewer when it’s confident.

The survivors are renormalized to sum to 100%, and Sample a token draws from that final distribution — exactly what the generation loop will do, one token at a time.

📖 Inference Engineering (Kiely) · §2.2 (p.46) 🔧 ds4 · softmax.metal, argsort.metal

Milestone M2 · the heart of the transformer

Attention: every token looks back

Inside each transformer block, a token builds its next representation by looking at the tokens before it and taking a weighted average of them. The weights come from how well a token’s query matches each other token’s key (Q·Kᵀ → softmax). Pick a token to see what it attends to. The softmax is real; the token vectors are a fixed toy example.

query token →

causal mask (no peeking ahead)

What you’re seeing

The chosen query token is compared against every token’s key; softmax turns those scores into weights that sum to 100% — shown as the bar under each chip and in the list.
The token’s output is the weighted sum of the value vectors: it has now mixed in information from whatever it attended to.
Turn off the causal mask to let a token look at later words. During generation we keep it on — token t can’t depend on words that don’t exist yet.

📖 Inference Engineering (Kiely) · §2.2.3 (p.52) 🔧 ds4 · flash_attn.metal, softmax.metal

Milestone M4 · why decode isn’t quadratic

The KV cache: don’t recompute the past

To produce the next token, attention needs the keys and values of every earlier token. Recomputing them at every step would be quadratic waste — so we cache each token’s K/V once and reuse it. Step through generation and watch the difference.

use the KV cache

What you’re seeing

Each cell is one token’s cached K/V. The glowing cells are the ones actually computed this step.
With the cache on, each decode step computes K/V for just one new token. With it off, every step recomputes the whole sequence.
Watch the totals diverge: cached work grows linearly, uncached work grows quadratically. That gap is what keeps long generations practical.

📖 Inference Engineering (Kiely) · §5.3 Caching (p.136) 🔧 ds4 · ds4_kvstore.c, dsv4_kv.metal

On the way

Diagrams we want to build next

RoPE — how a token’s position gets baked in by rotating its query and key vectors.
RMSNorm & the residual stream — what keeps a deep stack of blocks numerically stable.
Quantization — packing 16-bit weights into 4 bits, and what it costs (a preview of M5).

These arrive as their milestones land — see the abstraction ladder for where each one sits.