A from-scratch inference engine
Learn how LLM inference works — by building a tiny inference engine for Apple Silicon, and writing down what we figure out along the way.
Where we are
1 / 7 core milestones done · currently M1 — load the weights
The one-sentence version
Inference is: turn text into numbers, push those numbers through a big pile of matrix multiplications the model learned during training, and turn the result back into text — one token at a time, in a loop.
Everything else — KV caches, quantization, Metal kernels, MoE routing — is just making that loop correct, then fast, then small enough to fit. This site is the distilled map; the repo is where we build it.
The whole machine at a glance
An inference engine is a stack — from a chat loop at the top down to GPU threads at the bottom. Click any rung to see what it is, why it matters, and the milestone where we build it. Stop descending wherever it stops being interesting to you.
The outermost shell: a conversation is just a growing list of tokens. Out of scope for now — we want the engine underneath it first.
The loop that makes the model talk: run the forward pass, sample one token, append it, repeat until a stop token. This is where prefill and decode live (see below).
One full pass through the network turns a sequence of tokens into a score (logit) for every possible next token. Mostly matmuls.
The repeated unit. Attention lets each token mix in information from earlier tokens; the FFN transforms each token on its own. Stack it N times and you have the whole model body.
The actual math. Each named thing — RMSNorm, RoPE, attention, SwiGLU — is one tensor operation. This is the layer we hand-write.
Each tensor op becomes a small MSL program the GPU runs over thousands of threads. We submit them through raw Objective-C FFI — no wrapper crate — so nothing is hidden.
Weights are just numbers laid out in memory. How many bits each takes (quantization) decides whether the model fits — and, because decode is bandwidth-bound, how fast it runs.
The floor. We don’t build the GPU, but understanding its memory hierarchy is what makes the kernels above it fast.
Follow one prompt through the machine
The same story as the ladder, told as a pipeline. Each stage names the milestone where we build it.
Split text into integer token IDs with a learned BPE
vocabulary. "hello world" → [15339, 1917].
Each token ID indexes a row of the embedding matrix — now the model works on numbers, not text.
Vectors flow through N identical blocks: norm → attention → norm → FFN, each with a residual add.
A last norm, then a big matmul against the vocabulary gives a score for every possible next token.
Softmax with temperature turns scores into probabilities; greedy or top-k/top-p picks one token.
Add the new token and loop — but only for the one new token, thanks to the KV cache. Stream to screen.
Why inference feels the way it does
Internalizing this split explains almost every performance decision in the engine.
Process the whole prompt at once. Many tokens → big matmuls → compute-bound. Fast per token — the “reading your question” phase.
~463 tok/s
Generate one token at a time. Each step touches all the weights to produce one token → memory-bandwidth-bound. The slow, one-word-at-a-time “typing the answer” phase.
~26 tok/s
Headline ds4 numbers on an M5 Max (q2): same model, ~18× difference —
because decode is gated by memory bandwidth, not math.
What makes it practical
Once the loop is correct, almost all of inference engineering is three ideas.
Don’t recompute the past. Cache each layer’s keys/values so decode does one token’s worth of work instead of reprocessing the whole sequence.
Store 16-bit weights in 8/4/2 bits with clever scaling. Shrinks memory and — because decode is bandwidth-bound — speeds it up.
Do the math on the GPU, tightly. Each op becomes an MSL shader we submit via raw FFI; kernel fusion is the key speed lever.
Who covers what
| Layer | IE (Kiely) | ds4 | Failed Star builds |
|---|---|---|---|
| Concepts / vocabulary | ✅ everything | — | — (we cite it) |
| Tokenizer | mentions | ds4.c (hash table) | M0 (Rust) |
| Forward pass | ✅ the math | ds4.c + metal/* | M2 (Rust + MSL) |
| Sampling | ✅ | softmax, argsort | M3 (Rust) |
| KV cache | ✅ §5.3 | ds4_kvstore.c + SSD | M4 (RAM-first) |
| Quantization | ✅ §5.1 | gguf-tools/ | M5 (8/4-bit) |
| Metal kernels | ✅ §4.1 | ds4_metal.m | M6 (raw FFI) |
| Multi-backend (CUDA/ROCm) | Ch 3–4 | ds4_cuda.cu | ❌ Metal only |
| Distributed / server / agent | Ch 7 | ds4_server.c | ❌ out of scope |
ds4 runs DeepSeek-V4-Flash (284B / 13B-active MoE).
Failed Star starts with a tiny dense model — so the comparison is
1:1 for fundamentals and divergent for the fancy parts, which is
exactly why those are late milestones.