Skip to content

Extend tokenizer vocabulary with new words #627

Closed as not planned
Closed as not planned
@anferico

Description

@anferico

Suppose I have a pre-trained tokenizer, e.g. a BertWordPieceTokenizer, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).

Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the transformers repository (transformers/issues/2691 and transformers/issues/1413), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.

Here's a pseudo-code representation of what I would need:

pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()

technical_text = [
  'some text with unknown words',
  'some other text with unknown words',
  ...
]

updated_tokenizer = pre_trained_tokenizer.train(
  technical_text,
  initial_vocabulary=vocab
)

new_vocab = updated_tokenizer.get_vocab()  # 'new_vocab' contains all words in 'vocab' plus some new words

Can I do that with huggingface/tokenizers and/or huggingface/transformers?
I thought it would be an easy thing to do, but I wasn't able to find anything useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions