Gotchas in Tokenizer Behavior Every Developer Should Know

Community Article Published April 18, 2025

This is an evolving blog post about things you need to know when you're dev with tokenizers, which I've learned along the way, hoping it helps you avoid the mistakes I've made.

BOS token

1. Not all tokenizers have a BOS token

Example, Qwen/Qwen2.5-0.5B does not have a bos_token.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.bos_token is not None
False

while microsoft/Phi-3-mini-128k-instruct does.

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token is not None
True
>>> tokenizer.bos_token
'<s>'

2. A tokenizer can have a BOS token but not use it

Example, both microsoft/Phi-3-mini-128k-instruct and CohereLabs/aya-expanse-8b have a bos_token, but only CohereLabs/aya-expanse-8b uses it.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<s>', 1)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[25685, 338, 2253, 1135, 22769]
>>> tokenizer.bos_token_id in input_ids
False

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[32010, 1724, 338, 2253, 1135, 22769, 29973, 32007, 32001, 25685, 29889, 32007, 32000]
>>> tokenizer.bos_token_id in input_ids
False
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<BOS_TOKEN>', 5)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[5, 82653, 1801, 5329, 2924, 82092]
>>> tokenizer.bos_token_id in input_ids
True

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
>>> tokenizer.bos_token_id in input_ids
True

EOS token

3. Tokenizing doesn't add the EOS token

When you tokenize a string, it doesn't automatically add the EOS token.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.eos_token, tokenizer.eos_token_id
('<|endoftext|>', 151643)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[46518, 374, 2664, 1091, 27261]
>>> input_ids[-1] == tokenizer.eos_token_id
False

4. Applying the chat template could add the EOS token, but sometimes it doesn't, and sometimes it does but not at the end

Applying a chat template might add an EOS token—but not always, and not always at the end. Behavior varies across models:

  • Some templates append the EOS at the end, like meta-llama/Llama-3.2-1B-Instruct:

    >>> from transformers import AutoTokenizer
    >>> messages = [
    ...     {"role": "user", "content": "What is better than ugly?"},
    ...     {"role": "assistant", "content": "Beautiful."},
    ... ]
    >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|eot_id|>', 128009)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 972, 5186, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 2731, 1109, 28360, 30, 128009, 128006, 78191, 128007, 271, 47618, 13, 128009]
    >>> input_ids[-1] == tokenizer.eos_token_id
    True
    
  • Some don’t add an EOS at all, like databricks/dbrx-instruct:

    >>> tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|endoftext|>', 100257)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549, 555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177, 304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860, 3196, 389, 2038, 2561, 709, 311, 430, 1486, 627, 57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828, 43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847, 311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675, 7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058, 320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311, 1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920, 4390, 7, 2675, 656, 539, 617, 1972, 7394, 828, 2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473, 67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13, 1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11, 477, 3754, 9908, 323, 656, 539, 82791, 713, 3649, 315, 701, 4967, 828, 29275, 2028, 374, 701, 1887, 10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905, 433, 11, 1120, 6013, 311, 279, 1217, 13, 1442, 499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009, 13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430, 3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386, 72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781, 38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691, 1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198, 100278, 882, 198, 3923, 374, 2731, 1109, 28360, 30, 100279, 198, 100278, 78191, 198, 47618, 13, 100279]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    
  • Some add the EOS, but not in the very end like Qwen/Qwen2.5-0.5B-Instruct:

    >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|im_end|>', 151645)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 374, 2664, 1091, 27261, 30, 151645, 198, 151644, 77091, 198, 46518, 13, 151645, 198]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    >>> input_ids[-2] == tokenizer.eos_token_id
    True
    

PAD token

5. When pad_token equals eos_token

It's common to set the pad_token to the same value as the eos_token, but this requires extra care when masking or preparing labels.

For example:

labels = input_ids.clone()
labels[input_ids == tokenizer.pad_token_id] = -100  # ⚠️ Not safe if PAD == EOS

If pad_token_id == eos_token_id, this will also mask out actual eos_tokens, which are typically meaningful and shouldn't be ignored. When using the same ID for both, make sure your masking logic doesn't unintentionally remove valid eos_tokens.

Chat template

6. Applying the chat template is not a homomorphism with respect to concatenation

In other words, you can't apply the template separately to the prompt and the completion, then concatenate them—it won't produce the correct result.

This implies you should never apply the chat template to a standalone completion:

completion = tokenizer.apply_chat_template(completion)  # ❌ No

And for the prompt, you should either use continue_final_message=True or add_generation_prompt=True.

prompt = tokenizer.apply_chat_template(prompt, continue_final_message=True)  # ✅ OK
prompt = tokenizer.apply_chat_template(prompt, add_generation_prompt=True)   # ✅ OK
prompt = tokenizer.apply_chat_template(prompt)                               # ❌ NO

7. Chat template and tokenization don't compose due to special tokens

In other words, you can't apply the chat template and tokenize in sequence due to the special tokens (especially the BOS token).

If you try this:

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)  # ❌ NO

It won’t give the expected result because both apply_chat_template and tokenizer add special tokens. Instead, disable special token addition during tokenization:

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text, add_special_tokens=False)  # ✅ OK

Example with CohereLabs/aya-expanse-8b:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> messages = [
...     {"role": "user", "content": "What is better than ugly?"},
...     {"role": "assistant", "content": "Beautiful."},
... ]
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)["input_ids"]  # ❌ No
[5, 5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
tokenizer(text, add_special_tokens=False)["input_ids"]  # ✅ OK
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]

8. Adding a chat template isn’t enough — you also need to update the EOS token

When fine-tuning a base model and adding a chat template, that template typically includes a special end-of-turn token — for example, <|im_end|> in the case of Qwen/Qwen2.5-0.5B-Instruct. This token is used to indicate the end of a message in a chat format.

Here’s a simplified version of a chat template using Jinja syntax:

{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}

It’s crucial to update the model’s EOS token to match the end-of-turn token used in the template. If they don’t match, it can lead to issues like infinite generation.

tokenizer.chat_template = """\
{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
"""
tokenizer.eos_token = "<|im_end|>"  # ⚠️ Critical step

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment