Gotchas in Tokenizer Behavior Every Developer Should Know
This is an evolving blog post about things you need to know when you're dev with tokenizers, which I've learned along the way, hoping it helps you avoid the mistakes I've made.
BOS token
1. Not all tokenizers have a BOS token
Example, Qwen/Qwen2.5-0.5B
does not have a bos_token
.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.bos_token is not None
False
while microsoft/Phi-3-mini-128k-instruct
does.
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token is not None
True
>>> tokenizer.bos_token
'<s>'
2. A tokenizer can have a BOS token but not use it
Example, both microsoft/Phi-3-mini-128k-instruct
and CohereLabs/aya-expanse-8b
have a bos_token
, but only CohereLabs/aya-expanse-8b
uses it.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<s>', 1)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[25685, 338, 2253, 1135, 22769]
>>> tokenizer.bos_token_id in input_ids
False
>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[32010, 1724, 338, 2253, 1135, 22769, 29973, 32007, 32001, 25685, 29889, 32007, 32000]
>>> tokenizer.bos_token_id in input_ids
False
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<BOS_TOKEN>', 5)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[5, 82653, 1801, 5329, 2924, 82092]
>>> tokenizer.bos_token_id in input_ids
True
>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
>>> tokenizer.bos_token_id in input_ids
True
EOS token
3. Tokenizing doesn't add the EOS token
When you tokenize a string, it doesn't automatically add the EOS token.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.eos_token, tokenizer.eos_token_id
('<|endoftext|>', 151643)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[46518, 374, 2664, 1091, 27261]
>>> input_ids[-1] == tokenizer.eos_token_id
False
4. Applying the chat template could add the EOS token, but sometimes it doesn't, and sometimes it does but not at the end
Applying a chat template might add an EOS token—but not always, and not always at the end. Behavior varies across models:
Some templates append the EOS at the end, like
meta-llama/Llama-3.2-1B-Instruct
:>>> from transformers import AutoTokenizer >>> messages = [ ... {"role": "user", "content": "What is better than ugly?"}, ... {"role": "assistant", "content": "Beautiful."}, ... ] >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") >>> tokenizer.eos_token, tokenizer.eos_token_id ('<|eot_id|>', 128009) >>> input_ids = tokenizer.apply_chat_template(messages) >>> input_ids [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 972, 5186, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 2731, 1109, 28360, 30, 128009, 128006, 78191, 128007, 271, 47618, 13, 128009] >>> input_ids[-1] == tokenizer.eos_token_id True
Some don’t add an EOS at all, like
databricks/dbrx-instruct
:>>> tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct") >>> tokenizer.eos_token, tokenizer.eos_token_id ('<|endoftext|>', 100257) >>> input_ids = tokenizer.apply_chat_template(messages) >>> input_ids [100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549, 555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177, 304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860, 3196, 389, 2038, 2561, 709, 311, 430, 1486, 627, 57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828, 43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847, 311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675, 7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058, 320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311, 1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920, 4390, 7, 2675, 656, 539, 617, 1972, 7394, 828, 2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473, 67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13, 1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11, 477, 3754, 9908, 323, 656, 539, 82791, 713, 3649, 315, 701, 4967, 828, 29275, 2028, 374, 701, 1887, 10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905, 433, 11, 1120, 6013, 311, 279, 1217, 13, 1442, 499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009, 13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430, 3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386, 72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781, 38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691, 1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198, 100278, 882, 198, 3923, 374, 2731, 1109, 28360, 30, 100279, 198, 100278, 78191, 198, 47618, 13, 100279] >>> input_ids[-1] == tokenizer.eos_token_id False
Some add the EOS, but not in the very end like
Qwen/Qwen2.5-0.5B-Instruct
:>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") >>> tokenizer.eos_token, tokenizer.eos_token_id ('<|im_end|>', 151645) >>> input_ids = tokenizer.apply_chat_template(messages) >>> input_ids [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 374, 2664, 1091, 27261, 30, 151645, 198, 151644, 77091, 198, 46518, 13, 151645, 198] >>> input_ids[-1] == tokenizer.eos_token_id False >>> input_ids[-2] == tokenizer.eos_token_id True
PAD token
5. When pad_token
equals eos_token
It's common to set the pad_token
to the same value as the eos_token
, but this requires extra care when masking or preparing labels.
For example:
labels = input_ids.clone()
labels[input_ids == tokenizer.pad_token_id] = -100 # ⚠️ Not safe if PAD == EOS
If pad_token_id == eos_token_id
, this will also mask out actual eos_token
s, which are typically meaningful and shouldn't be ignored. When using the same ID for both, make sure your masking logic doesn't unintentionally remove valid eos_token
s.
Chat template
6. Applying the chat template is not a homomorphism with respect to concatenation
In other words, you can't apply the template separately to the prompt and the completion, then concatenate them—it won't produce the correct result.
This implies you should never apply the chat template to a standalone completion:
completion = tokenizer.apply_chat_template(completion) # ❌ No
And for the prompt, you should either use continue_final_message=True
or add_generation_prompt=True
.
prompt = tokenizer.apply_chat_template(prompt, continue_final_message=True) # ✅ OK
prompt = tokenizer.apply_chat_template(prompt, add_generation_prompt=True) # ✅ OK
prompt = tokenizer.apply_chat_template(prompt) # ❌ NO
7. Chat template and tokenization don't compose due to special tokens
In other words, you can't apply the chat template and tokenize in sequence due to the special tokens (especially the BOS token).
If you try this:
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text) # ❌ NO
It won’t give the expected result because both apply_chat_template
and tokenizer
add special tokens. Instead, disable special token addition during tokenization:
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text, add_special_tokens=False) # ✅ OK
Example with CohereLabs/aya-expanse-8b
:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> messages = [
... {"role": "user", "content": "What is better than ugly?"},
... {"role": "assistant", "content": "Beautiful."},
... ]
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)["input_ids"] # ❌ No
[5, 5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
tokenizer(text, add_special_tokens=False)["input_ids"] # ✅ OK
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
8. Adding a chat template isn’t enough — you also need to update the EOS token
When fine-tuning a base model and adding a chat template, that template typically includes a special end-of-turn token — for example, <|im_end|>
in the case of Qwen/Qwen2.5-0.5B-Instruct
. This token is used to indicate the end of a message in a chat format.
Here’s a simplified version of a chat template using Jinja syntax:
{%- for message in messages %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
It’s crucial to update the model’s EOS token to match the end-of-turn token used in the template. If they don’t match, it can lead to issues like infinite generation.
tokenizer.chat_template = """\
{%- for message in messages %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
"""
tokenizer.eos_token = "<|im_end|>" # ⚠️ Critical step