BPE Tokenizer #286

wannaphong · 2025-01-28T06:50:26Z

Is it possible use bpe tokenizer instead rwkv_vocab_v20230424 in the next model?

I tried rwkv model in Thai language. It look good but it is very slow because Thai is character level for rwkv_vocab_v20230424.

I think if the next model use bpe tokenizer like qwen2, It can improve model and the speed.

BlinkDL · 2025-02-16T10:02:55Z

or we can simply improve thai tokens in rwkv_vocab_v20230424 :)

please try to improve it and let me know

ofou · 2025-02-17T21:34:09Z

Is there any way to use a custom tokenizer by the way? (like UTF-8 ids)

I'd be cool to get rid of the tokenizer itself, obviously at the expense of losing some computing efficiency for more interpretability, while covering the whole Unicode.

wannaphong mentioned this issue Feb 16, 2025

Add more thai tokens to tokenizer BlinkDL/ChatRWKV#208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BPE Tokenizer #286

BPE Tokenizer #286

wannaphong commented Jan 28, 2025

BlinkDL commented Feb 16, 2025 •

edited

Loading

Uh oh!

ofou commented Feb 17, 2025

Uh oh!

Uh oh!

BPE Tokenizer #286

BPE Tokenizer #286

Comments

wannaphong commented Jan 28, 2025

BlinkDL commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofou commented Feb 17, 2025

Uh oh!

BlinkDL commented Feb 16, 2025 •

edited

Loading