Description
I want to expand my vocab from 32000 to 32002 by adding two special tokens: '<start>' and '<end>'. However, after I hacked the BPE model (appended two user-defined tokens), modified the size of relative weights of the model, and converted it to a .ggml file, I found that the special tokens were still split into several token IDs. Adjusting the scores of the special tokens did not work. I do not want to insert the special tokens between 0 to 32000 to keep the original order of tokens, so I have not tested inserting the special words at different positions.
In the original SentencePiece tokenizer, it will also not work when the type of special token is not user-defined https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. So I want to ask whether llama_cpp considers supporting user-defined tokens? (Before this, an option is to use sentencepiece of python and llama-cpp-python to inference)
Thanks for your reply.
(By the way, the Llama tokenizer (BPE) was trained by adding an add_dummy_prefix option, so do not directly use the add_special_tokens function of the tokenizer of hugging face transformers in your training. Because it will first cut the entire sentence according to the special token, and then throw the remaining parts to the bpe model, so spaces will be added at the beginning of each part by the bpe model, which will be inconsistent with the token during llama_cpp inference.)