Extend tokenizer vocabulary with new words

Suppose I have a pre-trained tokenizer, e.g. a `BertWordPieceTokenizer`, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).

Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the `transformers` repository ([transformers/issues/2691](https://github.com/huggingface/transformers/issues/2691) and [transformers/issues/1413](https://github.com/huggingface/transformers/issues/1413)), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.

Here's a pseudo-code representation of what I would need:
```
pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()

technical_text = [
  'some text with unknown words',
  'some other text with unknown words',
  ...
]

updated_tokenizer = pre_trained_tokenizer.train(
  technical_text,
  initial_vocabulary=vocab
)

new_vocab = updated_tokenizer.get_vocab()  # 'new_vocab' contains all words in 'vocab' plus some new words
```

Can I do that with `huggingface/tokenizers` and/or `huggingface/transformers`?
I thought it would be an easy thing to do, but I wasn't able to find anything useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend tokenizer vocabulary with new words #627

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extend tokenizer vocabulary with new words #627

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions