Milestone M0 · the model’s front door
The first thing every prompt hits and the last thing every reply passes through. M0 turns a string into the integer IDs the model consumes — and back — using Qwen3-0.6B’s own byte-level BPE vocabulary, so our IDs match the official tokenizer exactly. No GPU, no weights; pure Rust.
tokenizer.json.
$ fs tokenize "hello world" → 14990 1879
$ fs detokenize 14990 1879 → hello world
For why BPE works this way — the merges are learned once by the model creators, and we only ever replay them — read learning 03 · byte-pair encoding first. This page is what we built.
Encoding, stage by stage
The key structural fact: a coarse split into words,
and within each word, a fine split into bytes that BPE
merges back up. pretokenize cuts the sentence into words;
bpe runs on one word at a time, and merges never cross a word
boundary. (That nesting is the thing most people get backwards.)
A fixed regex (read from Qwen’s
tokenizer.json) splits the text. A leading space stays attached:
"hello world" → ["hello", " world"].
Each raw byte is remapped to a printable
Unicode char so a token never holds a literal space/newline. 0x20 →
Ġ. The map is a bijection over all 256 bytes.
Explode the chunk to single chars, then repeatedly glue the adjacent pair with the lowest merge rank until none apply. The survivors are the tokens.
Each surviving piece is a vocab key → its integer ID. Concatenate across all chunks for the final sequence.
Before stage 1, any special-token literal
(<|im_start|> …) is carved out and emitted as its id directly,
bypassing BPE — see split_on_special_tokens.
Decoding reverses 4→2: ID → piece → concatenate → undo the byte map → raw UTF-8 (a special id decodes straight to its literal text). No regex, no merging — the merges are already baked into the pieces.
The canonical example
One chunk, all printable ASCII, so its byte-level form is just
hello. Each pass applies the lowest-rank merge present
(#N = the real rank from Qwen’s merge list, model.merges
in tokenizer.json):
| pass | symbols | candidate pairs (rank) | winner |
|---|---|---|---|
| 1 | h e l l o | (h,e)=127, (e,l)=45, (l,l)=398, (l,o)=129 | (e,l) |
| 2 | h el l o | (h,el)=48866, (el,l)=357, (l,o)=129 | (l,o) |
| 3 | h el lo | (h,el)=48866, (el,lo)=4535 | (el,lo) |
| 4 | h ello | (h,ello)=14734 | (h,ello) |
| 5 | hello | none | — stop |
e+l),
not at the front — even though (h,e) was a valid merge sitting right
there. Rank 45 beats rank 127, so left-to-right greedy would be wrong; we must pick
the global-minimum rank.
(2) (h,e) then never fires: once e
is absorbed into el, h waits and merges with the whole
ello. A high-priority pair can be permanently starved — and
that is BPE behaving correctly. (Pinned as the test
bpe_reproduces_the_hello_trace.)
Where it gets subtle
"hello world" → [14990, 1879] but
" hello world" → [23811, 1879] — a different first
token. If the byte map or regex is wrong, this breaks first.
Qwen’s pattern uses \p{N}, not \p{N}+, so
"123" → ["1","2","3"]. We read the pattern straight
from tokenizer.json rather than retype GPT-4’s {1,3}.
model.merges is already in priority order, so rank = array index
(first pair = rank 0). A wrong base would silently corrupt every
tokenization. (The old merges.txt had a header to skip — an
off-by-one trap the array sidesteps.)
Merging (a,a) over [a,a,a] yields [aa,a],
not a reused middle a — we consume both halves on a hit.
Lookups return typed errors,
never panics: a piece outside the vocab → UnknownToken
(theoretically impossible, but surfaced not crashed); an out-of-range id in
decode → InvalidTokenId.
The milestone’s “done” gate
The fixture tests/golden/tokenizer.json is generated by the
official Hugging Face tokenizer (a one-shot oracle), then committed.
tests/golden_tokenizer.rs asserts, for all 14 cases:
encode(text) equals the official IDs,decode(official_ids) equals the official text,decode(encode(text)) equals text.Exact-ID parity across a diverse set — ASCII, CJK, emoji, code with tabs/newlines,
digit runs — is the definition of “works with this model”: it proves the
byte map, the regex, and the merge order are all correct at once. A second
integration test covers special tokens (<|im_start|> →
151644, carving, round-trip). Both skip gracefully if the (git-ignored)
model assets aren’t fetched, so a fresh checkout stays green.
tokenizer.json’s added_tokens, carved out in
encode, decoded back in decode. What’s deferred:
the chat template (wrapping turns in
<|im_start|>/<|im_end|>) is an M3 concern; and
bpe’s O(n²)-per-word loop awaits a HashMap memoization
of bpe(word) as a recorded later optimization.
See also the live
toy-BPE diagram,
learning 03 · BPE, and the full writeup
docs/01-tokenizer.md.