Description
My use case is that I have a corpus of instructions, e.g. taking a small snapshot:
push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>
Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])
trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)
This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.
When I run:
loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
I get an error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3
It seems like it is related to: #566
I followed to: #909
which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.
Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?