Skip to content

Issue merging across whitespaces #1475

Closed as not planned
Closed as not planned
@henrycharlesworth

Description

@henrycharlesworth

My use case is that I have a corpus of instructions, e.g. taking a small snapshot:

push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>

Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:

tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])

trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)

This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.

When I run:

loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

I get an error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3

It seems like it is related to: #566
I followed to: #909
which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.

Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions