Issue merging across whitespaces

My use case is that I have a corpus of instructions, e.g. taking a small snapshot:

```
push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>
```
Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:

```
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])

trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)
```

This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it. 

When I run:

```
loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
```

I get an error:
`Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3`

It seems like it is related to: https://github.com/huggingface/tokenizers/issues/566
I followed to: https://github.com/huggingface/tokenizers/pull/909
which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.

Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue merging across whitespaces #1475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue merging across whitespaces #1475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions