HuggingFace WordPiece Tokenizer in C++

This is a C++ implementation of WordPiece (BERT) tokenizer inference.

It expects from you a .json file in HuggingFace format that contains all the required information to setup the tokenizer. You can usually download this file from HuggingFace model hub.

How to use it?

Set the path to your .json file when creating the tokenizer:

WordPieceTokenizer tokenizer("tokenizer.json");

Build:

This implementation requires the International Components for Unicode (ICU) library to handle Unicode. Install it with:

sudo apt-get install libicu-dev

Compile the tokenizer:

g++ tokenizer.cpp -licuuc -o tokenizer

Run:

Create a file sample_file.txt that contains all the text that you want to tokenize. Then run the following command to get the indices of the generated tokens on your stdout.

./tokenizer sample_file.txt

Testing:

If you would like to compare this tool’s token IDs to those from the native Python HuggingFace implementation automatically, you can do the following:

(Optionally) Add your test .txt files to tests/input_texts folder.
Run the following command to see if C++ implementation matches the HuggingFace implementation.

python run_tests.py

If you find an example for which this tool fails, feel free to open an issue, and I'll look into it.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nlohmann		nlohmann
tests/input_texts		tests/input_texts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_tests.py		run_tests.py
tokenizer.cpp		tokenizer.cpp
tokenizer.json		tokenizer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HuggingFace WordPiece Tokenizer in C++

How to use it?

Build:

Run:

Testing:

About

Uh oh!

Releases

Packages

Languages

License

Sorrow321/huggingface_tokenizer_cpp

Folders and files

Latest commit

History

Repository files navigation

HuggingFace WordPiece Tokenizer in C++

How to use it?

Build:

Run:

Testing:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages