Skip to content

BPE Tokenizer #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wannaphong opened this issue Jan 28, 2025 · 2 comments
Open

BPE Tokenizer #286

wannaphong opened this issue Jan 28, 2025 · 2 comments

Comments

@wannaphong
Copy link

Is it possible use bpe tokenizer instead rwkv_vocab_v20230424 in the next model?

I tried rwkv model in Thai language. It look good but it is very slow because Thai is character level for rwkv_vocab_v20230424.

I think if the next model use bpe tokenizer like qwen2, It can improve model and the speed.

@BlinkDL
Copy link
Owner

BlinkDL commented Feb 16, 2025

or we can simply improve thai tokens in rwkv_vocab_v20230424 :)

please try to improve it and let me know

@ofou
Copy link

ofou commented Feb 17, 2025

Is there any way to use a custom tokenizer by the way? (like UTF-8 ids)

I'd be cool to get rid of the tokenizer itself, obviously at the expense of losing some computing efficiency for more interpretability, while covering the whole Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants