M0 — Tokenizer — Failed Star

Milestone M0 · the model’s front door

Tokenizer: text ↔ token IDs

The first thing every prompt hits and the last thing every reply passes through. M0 turns a string into the integer IDs the model consumes — and back — using Qwen3-0.6B’s own byte-level BPE vocabulary, so our IDs match the official tokenizer exactly. No GPU, no weights; pure Rust.

Status: ✅ done. All 14 golden cases reproduce the official Qwen IDs, decode back, and round-trip; special tokens are supported; 17 unit tests + 2 integration tests, no warnings. Everything loads from the single tokenizer.json.

$ fs tokenize "hello world"     →  14990 1879
$ fs detokenize 14990 1879      →  hello world

For why BPE works this way — the merges are learned once by the model creators, and we only ever replay them — read learning 03 · byte-pair encoding first. This page is what we built.

Encoding, stage by stage

Four stages — and they’re nested

The key structural fact: a coarse split into words, and within each word, a fine split into bytes that BPE merges back up. pretokenize cuts the sentence into words; bpe runs on one word at a time, and merges never cross a word boundary. (That nesting is the thing most people get backwards.)

1pretokenize

Text → word chunks

A fixed regex (read from Qwen’s tokenizer.json) splits the text. A leading space stays attached: "hello world" → ["hello", " world"].

2build_byte_encoder

Bytes → printable chars

Each raw byte is remapped to a printable Unicode char so a token never holds a literal space/newline. 0x20 → Ġ. The map is a bijection over all 256 bytes.

3bpe

Merge by rank

Explode the chunk to single chars, then repeatedly glue the adjacent pair with the lowest merge rank until none apply. The survivors are the tokens.

4token_to_id

Pieces → IDs

Each surviving piece is a vocab key → its integer ID. Concatenate across all chunks for the final sequence.

Before stage 1, any special-token literal (<|im_start|> …) is carved out and emitted as its id directly, bypassing BPE — see split_on_special_tokens.

Decoding reverses 4→2: ID → piece → concatenate → undo the byte map → raw UTF-8 (a special id decodes straight to its literal text). No regex, no merging — the merges are already baked into the pieces.

The canonical example

“hello” → 14990

One chunk, all printable ASCII, so its byte-level form is just hello. Each pass applies the lowest-rank merge present (#N = the real rank from Qwen’s merge list, model.merges in tokenizer.json):

pass	symbols	candidate pairs (rank)	winner
1	`h e l l o`	(h,e)=127, (e,l)=45, (l,l)=398, (l,o)=129	(e,l)
2	`h el l o`	(h,el)=48866, (el,l)=357, (l,o)=129	(l,o)
3	`h el lo`	(h,el)=48866, (el,lo)=4535	(el,lo)
4	`h ello`	(h,ello)=14734	(h,ello)
5	`hello`	none	— stop

Two lessons a toy trace hides. (1) The first merge is in the middle (e+l), not at the front — even though (h,e) was a valid merge sitting right there. Rank 45 beats rank 127, so left-to-right greedy would be wrong; we must pick the global-minimum rank. (2) (h,e) then never fires: once e is absorbed into el, h waits and merges with the whole ello. A high-priority pair can be permanently starved — and that is BPE behaving correctly. (Pinned as the test bpe_reproduces_the_hello_trace.)

Where it gets subtle

Gotchas we hit

Leading spaces are load-bearing

"hello world" → [14990, 1879] but " hello world" → [23811, 1879] — a different first token. If the byte map or regex is wrong, this breaks first.

Each digit is its own chunk

Qwen’s pattern uses \p{N}, not \p{N}+, so "123" → ["1","2","3"]. We read the pattern straight from tokenizer.json rather than retype GPT-4’s {1,3}.

Merge order is everything

model.merges is already in priority order, so rank = array index (first pair = rank 0). A wrong base would silently corrupt every tokenization. (The old merges.txt had a header to skip — an off-by-one trap the array sidesteps.)

Merge non-overlapping

Merging (a,a) over [a,a,a] yields [aa,a], not a reused middle a — we consume both halves on a hit.

Lookups return typed errors, never panics: a piece outside the vocab → UnknownToken (theoretically impossible, but surfaced not crashed); an out-of-range id in decode → InvalidTokenId.

The milestone’s “done” gate

How we know it’s right

The fixture tests/golden/tokenizer.json is generated by the official Hugging Face tokenizer (a one-shot oracle), then committed. tests/golden_tokenizer.rs asserts, for all 14 cases:

encode(text) equals the official IDs,
decode(official_ids) equals the official text,
round-trip decode(encode(text)) equals text.

Exact-ID parity across a diverse set — ASCII, CJK, emoji, code with tabs/newlines, digit runs — is the definition of “works with this model”: it proves the byte map, the regex, and the merge order are all correct at once. A second integration test covers special tokens (<|im_start|> → 151644, carving, round-trip). Both skip gracefully if the (git-ignored) model assets aren’t fetched, so a fresh checkout stays green.

Special tokens are supported (no longer stubbed): loaded from tokenizer.json’s added_tokens, carved out in encode, decoded back in decode. What’s deferred: the chat template (wrapping turns in <|im_start|>/<|im_end|>) is an M3 concern; and bpe’s O(n²)-per-word loop awaits a HashMap memoization of bpe(word) as a recorded later optimization.

📖 Inference Engineering (Kiely) · §2.2 (p.46) 🔧 ds4 · ds4.c — bpe_tokenize_text, gpt2_byte_to_codepoint, str_i32_table 🧭 Raschka · Build an LLM — BPE chapter

See also the live toy-BPE diagram, learning 03 · BPE, and the full writeup docs/01-tokenizer.md.