A from-scratch inference engine

Failed Star

Learn how LLM inference works — by building a tiny inference engine for Apple Silicon, and writing down what we figure out along the way.

A cartoon five-pointed star, hunched over and looking dejected — a failed star.

Where we are

Build progress

1 / 7 core milestones done · currently M1 — load the weights

  1. M0Tokenizer
  2. M1Load weights
  3. M2Forward pass
  4. M3Sampling
  5. M4KV cache
  6. M5Quantization
  7. M6Metal kernels
  8. 7+Stretch

The one-sentence version

It’s just text → numbers → text, in a loop

Inference is: turn text into numbers, push those numbers through a big pile of matrix multiplications the model learned during training, and turn the result back into text — one token at a time, in a loop.

Everything else — KV caches, quantization, Metal kernels, MoE routing — is just making that loop correct, then fast, then small enough to fit. This site is the distilled map; the repo is where we build it.

📖 Inference Engineering (Kiely) · Ch 0, Ch 2 🔧 ds4 · the reference engine

The whole machine at a glance

The abstraction ladder

An inference engine is a stack — from a chat loop at the top down to GPU threads at the bottom. Click any rung to see what it is, why it matters, and the milestone where we build it. Stop descending wherever it stops being interesting to you.

↑ highest abstraction · lowest ↓
7 Chat / agent loop
“user types, model replies, repeat” — prompt templating, tool calls, sessions
later / maybe

The outermost shell: a conversation is just a growing list of tokens. Out of scope for now — we want the engine underneath it first.

6 Generation loop
prefill → decode → sample → append; temperature, top-k/top-p, streaming
M3

The loop that makes the model talk: run the forward pass, sample one token, append it, repeat until a stop token. This is where prefill and decode live (see below).

5 The model forward pass
embeddings → N transformer blocks → final norm → logits
M2

One full pass through the network turns a sequence of tokens into a score (logit) for every possible next token. Mostly matmuls.

4 One transformer block
norm → attention (+RoPE, +KV cache) → norm → FFN/MoE → residual adds
M2

The repeated unit. Attention lets each token mix in information from earlier tokens; the FFN transforms each token on its own. Stack it N times and you have the whole model body.

3 Tensor operations
matmul, softmax, RMSNorm, RoPE, SwiGLU, argmax/sampling
M2–M3

The actual math. Each named thing — RMSNorm, RoPE, attention, SwiGLU — is one tensor operation. This is the layer we hand-write.

2 Kernels
each op as a Metal (MSL) shader, run on the GPU over a command buffer
M6

Each tensor op becomes a small MSL program the GPU runs over thousands of threads. We submit them through raw Objective-C FFI — no wrapper crate — so nothing is hidden.

1 Memory & numbers
weights in RAM, dtype/quantization, unified memory, buffers, bandwidth
M1 · M5

Weights are just numbers laid out in memory. How many bits each takes (quantization) decides whether the model fits — and, because decode is bandwidth-bound, how fast it runs.

0 The hardware
Apple Silicon GPU: threads, SIMD, threadgroups, registers, the memory hierarchy
we target it

The floor. We don’t build the GPU, but understanding its memory hierarchy is what makes the kernels above it fast.

How we build: roughly bottom-meets-middle — enough of rungs 1–3 to make rung 5 produce correct logits, then rung 6 to make it talk, then back down into rungs 2/1/0 to make it fast.

Follow one prompt through the machine

The data’s journey

The same story as the ladder, told as a pipeline. Each stage names the milestone where we build it.

M0STAGE A

Text → tokens

Split text into integer token IDs with a learned BPE vocabulary. "hello world"[15339, 1917].

M2STAGE B

Tokens → vectors

Each token ID indexes a row of the embedding matrix — now the model works on numbers, not text.

M2STAGE C

Transformer blocks

Vectors flow through N identical blocks: norm → attention → norm → FFN, each with a residual add.

M2STAGE D

Final norm → logits

A last norm, then a big matmul against the vocabulary gives a score for every possible next token.

M3STAGE E

Logits → next token

Softmax with temperature turns scores into probabilities; greedy or top-k/top-p picks one token.

M3–M4STAGE F

Append & repeat

Add the new token and loop — but only for the one new token, thanks to the KV cache. Stream to screen.

Why inference feels the way it does

Two phases: prefill vs decode

Internalizing this split explains almost every performance decision in the engine.

Prefill

Process the whole prompt at once. Many tokens → big matmuls → compute-bound. Fast per token — the “reading your question” phase.

~463 tok/s

Decode

Generate one token at a time. Each step touches all the weights to produce one token → memory-bandwidth-bound. The slow, one-word-at-a-time “typing the answer” phase.

~26 tok/s

Headline ds4 numbers on an M5 Max (q2): same model, ~18× difference — because decode is gated by memory bandwidth, not math.

📖 §2.4 Inference Bottlenecks (p.61)

What makes it practical

Correct → fast → small

Once the loop is correct, almost all of inference engineering is three ideas.

M4

KV cache

Don’t recompute the past. Cache each layer’s keys/values so decode does one token’s worth of work instead of reprocessing the whole sequence.

M5

Quantization

Store 16-bit weights in 8/4/2 bits with clever scaling. Shrinks memory and — because decode is bandwidth-bound — speeds it up.

M6

Metal kernels

Do the math on the GPU, tightly. Each op becomes an MSL shader we submit via raw FFI; kernel fusion is the key speed lever.

Who covers what

Failed Star vs Dwarf Star vs IE (Kiely)

LayerIE (Kiely)ds4Failed Star builds
Concepts / vocabulary✅ everything— (we cite it)
Tokenizermentionsds4.c (hash table)M0 (Rust)
Forward pass✅ the mathds4.c + metal/*M2 (Rust + MSL)
Samplingsoftmax, argsortM3 (Rust)
KV cache✅ §5.3ds4_kvstore.c + SSDM4 (RAM-first)
Quantization✅ §5.1gguf-tools/M5 (8/4-bit)
Metal kernels✅ §4.1ds4_metal.mM6 (raw FFI)
Multi-backend (CUDA/ROCm)Ch 3–4ds4_cuda.cu❌ Metal only
Distributed / server / agentCh 7ds4_server.c❌ out of scope

ds4 runs DeepSeek-V4-Flash (284B / 13B-active MoE). Failed Star starts with a tiny dense model — so the comparison is 1:1 for fundamentals and divergent for the fancy parts, which is exactly why those are late milestones.