Description
Suppose I have a pre-trained tokenizer, e.g. a BertWordPieceTokenizer
, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).
Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the transformers
repository (transformers/issues/2691 and transformers/issues/1413), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.
Here's a pseudo-code representation of what I would need:
pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()
technical_text = [
'some text with unknown words',
'some other text with unknown words',
...
]
updated_tokenizer = pre_trained_tokenizer.train(
technical_text,
initial_vocabulary=vocab
)
new_vocab = updated_tokenizer.get_vocab() # 'new_vocab' contains all words in 'vocab' plus some new words
Can I do that with huggingface/tokenizers
and/or huggingface/transformers
?
I thought it would be an easy thing to do, but I wasn't able to find anything useful.