rednote-hilab/dots.llm1.inst · Is EOS token in tokenizer.json and tokenizer

For context see: https://github.com/ggml-org/llama.cpp/issues/14044#issuecomment-2952938086 where I noticed this.

Are the token names correctly set in the tokenizer.json and tokenizer_config.json?

In llama.cpp I saw this (this is a tool that tokenizes a prompt so you can see the numerical token IDs):

$ llama-tokenize --model dots1-instruct-q4.gguf --prompt "<|userprompt|><|endofuserprompt|><|response|><|endofresponse|><|im_end|><|im_start|>" --log-disable
151646 -> '<|userprompt|>'
151647 -> '<|endofuserprompt|>'
151648 -> '<|response|>'
151649 -> '<|endofresponse|>'
151645 -> '<|im_end|>'
151644 -> '<|im_start|>'

But I then compare here, with tokenizer_config.json (https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/main/tokenizer_config.json)

"151646" -> I can't find this in tokenizer_config.json
"151647" -> "<|userprompt|>" (this is <|endofuserprompt|> in llama.cpp side)
"151648" -> "<|endofuserprompt|>" (this is <|response|> in llama.cpp side)
"151649" -> "<|response|>" (this is <|endofresponse|> in llama.cpp side)
"151645" -> "<|im_end|>"
"151644" -> "<|im_start|>"

There might be more tokens off, these are what I saw on initial check. I was hacking together llama.cpp support and the WIP .gguf converter code using defaults picked up 151650 as the stopping token (which here in tokenizer_config.json is <|endofresponse|> but llama.cpp has 151650 -> '<|system|>' instead. It made llama.cpp not realize when to stop generating.

Empirically I saw the model wanted to output 151649 which is why I think that is maybe meant to be <|endofresponse|> instead of <|response|>.

The base model files might be correct? 151649 is <|endofresponse|> in the base model side here: https://huggingface.co/rednote-hilab/dots.llm1.base/blob/main/tokenizer_config.json

Edit: Maybe one addition, I see 151645 as eos_token_id generally in config.json and generation_config.json as well, which is <|im_end|>. That's a ChatML-style ending tag, if my memory is corrrect. But this model does not use ChatML, does it? On top of my head I don't remember if llama.cpp conversions scripts inspect the token ID fields in these files but I could imagine other tooling does. Should it be set to <|endofresponse|> 151649 token as well? Code links: https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/8a162c1ff5a8a22fa21fdc1bf222ce7cc08e2539/config.json#L8 and https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/8a162c1ff5a8a22fa21fdc1bf222ce7cc08e2539/generation_config.json#L7

rednote-hilab
/

dots.llm1.inst

Is EOS token in tokenizer.json and tokenizer_config.json correct?