convert: handle when model's tokenization method relies on Mecab #13830

huydt84 · 2025-05-27T15:28:03Z

This PR add more supports to Japanese-based models (especially BertJapanese) via:

Auto install fugashi[unidic-lite] when model's tokenization method relies on Mecab
Only print the "pre_tokenizer" content from the tokenizer.json, if the file exists
download_model function can work with other files if 1 file doesn't exist (since many BertJapanese models don't have tokenizer.json, which can disrupt downloading process

convert_hf_to_gguf_update.py

ngxson

Keep in mind that other models also use the same script, try not to introduce destructive changes that may affect other models.

ngxson · 2025-05-27T23:12:58Z

convert_hf_to_gguf_update.py

+    except requests.HTTPError as e:
+        if e.response.status_code == 404:
+            logger.warning(f"URL not found: {url}")
+        else:
+            logger.error(f"HTTP error occurred when downloading {url}: {e}")
+    except requests.ConnectionError:
+        logger.error(f"Connection error occurred when downloading {url}")
+    except Exception as e:
+        logger.error(f"Unexpected error occurred when downloading {url}: {e}")


This whole multiple except can be just one single except Exception as e. No need to over-engineer the error handling if you only interested in logging it

The old code doesn't have this handling, so it will simply terminate the script if there an error. Now with this, error will be ignored. I think this is not the expected behavior

Actually I don't have access to many models in the list, so the script will be terminated everytime I run it (without commenting other models). The instructions at the beginning of the file state that Add a new model to the "models" list, which may make users confused

What is your suggestion about this?

I usually just temporary comment out all the other models then run the script. But yes having the ability to update only added model will be a better approach. I will add it in another PR

For now, let's simply remove this change from this PR

I removed this change.

I will add it in another PR

Thank you in advance!

ngxson · 2025-05-27T23:16:13Z

convert_hf_to_gguf_update.py

-        if "ignore_merges" in cfg["model"]:
-            logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4))
+    # print the "pre_tokenizer" content from the tokenizer.json, if exists
+    if os.path.isfile(f"models/tokenizers/{name}/tokenizer.json"):


This will alter the behavior of other models

instead, check for cfg["word_tokenizer_type"] == "mecab" and only skip this on that particular model

I'm sorry. I just fixed that

ngxson · 2025-05-28T08:09:07Z

convert_hf_to_gguf_update.py

+import subprocess
+import importlib.util


this can be removed

I removed it

ngxson · 2025-05-28T08:11:02Z

convert_hf_to_gguf_update.py

@@ -117,17 +118,47 @@ class TOKENIZER_TYPE(IntEnum):
    {"name": "glm4",             "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", },
    {"name": "pixtral",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", },
    {"name": "seed-coder",       "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
+    {"name": "ruri-large",       "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/cl-nagoya/ruri-large", },


If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

and btw, do we even have the CPP code to handle this? is this already tested?

I tested that model and similar models (ruri-*) locally for embedding task and it worked.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to convert_hf_to_gguf. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models require vocab.txt)

When #13847 is merged, you can run the script again and this time it will only process the newly added model

huydt84 · 2025-06-02T12:54:23Z

@ngxson Please check again.

convert_hf_to_gguf_update.py

CISC · 2025-06-03T08:18:31Z

Hmmm, wait, is this PR just to tell the user how to install fugashi for BertJapaneseTokenizer? If so, this won't work, users won't be running convert_hf_to_gguf_update.py, and even if they did, that code won't run now.

huydt84 · 2025-06-03T08:22:54Z

@CISC if BertJapanese models (or other Japanese-based models) use Mecab tokenizer, fugashi library and unidic dictionary (or corresponding solutions) must be installed in order to make AutoTokenizer work normally

huydt84 · 2025-06-03T08:40:52Z

The PR name is a bit confusing

CISC · 2025-06-03T08:46:27Z

Right, but this PR fails to do that ATM, and I'm not sure you can cleanly add it to convert_hf_to_gguf.py where it belongs either, is there some check you can do in the model class, or does conversion crash before that?

huydt84 · 2025-06-03T08:53:35Z

The conversion crashes when AutoTokenizer is loaded. This problem is not dependent on model class, but the tokenizer type in config.json

CISC · 2025-06-03T09:10:54Z

Yeah, I was afraid of that, here right?

llama.cpp/convert_hf_to_gguf.py

Lines 598 to 603 in 5e1c3ae

    
           def get_vocab_base(self) -> tuple[list[str], list[int], str]: 
        
               tokens: list[str] = [] 
        
               toktypes: list[int] = [] 
        
               from transformers import AutoTokenizer 
        
               tokenizer = AutoTokenizer.from_pretrained(self.dir_model)

So, unfortunately I think you have to scrap everything, the current changes are incorrect and doesn't solve the problem.

I took a look though, and why do we need to emit our own notice? Doesn't AutoTokenizer already do that?
https://github.com/huggingface/transformers/blob/55ec319de6a90c7b8db1218a5d73837fc6098371/src/transformers/models/bert_japanese/tokenization_bert_japanese.py#L407-L413

huydt84 · 2025-06-03T09:21:49Z

Doesn't AutoTokenizer already do that?

Wow, I actually don't know about that :) So the changes in this PR is meaningless (also have some trivial fixes but I think the main purpose is not achieved)

convert: add support for Japanese Bert model

036b5d6

github-actions bot added the python python script changes label May 27, 2025

ngxson reviewed May 27, 2025

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

remove auto install, only throw error if fugashi is missing

c484802

huydt84 requested a review from ngxson May 27, 2025 22:58

ngxson requested changes May 27, 2025

View reviewed changes

Only skip pre_tokenizer print for mecab tokenizer type

4e0f769

ngxson reviewed May 28, 2025

View reviewed changes

huydt-bti added 5 commits May 28, 2025 19:06

restore download_file_with_auth

547b380

small import lint restore

0192cab

Merge branch 'master' into huydt/bert-ja-support

94184ae

Merge branch 'master' into huydt/bert-ja-support

f256169

update convert_hf_to_gguf to include ruri-large

a6b9bde

huydt84 requested a review from ngxson June 2, 2025 12:54

CISC requested changes Jun 2, 2025

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

CISC requested changes Jun 2, 2025

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

fix typecheck and lint

a7fef9c

huydt84 changed the title ~~convert: add support for Japanese Bert model~~ convert: handle when model's tokenization method relies on Mecab Jun 3, 2025

huydt84 closed this Jun 3, 2025

convert: handle when model's tokenization method relies on Mecab #13830

convert: handle when model's tokenization method relies on Mecab #13830

Uh oh!

Conversation

huydt84 commented May 27, 2025

Uh oh!

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydt84 commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

CISC commented Jun 3, 2025

Uh oh!

huydt84 commented Jun 3, 2025

Uh oh!

huydt84 commented Jun 3, 2025

Uh oh!

CISC commented Jun 3, 2025

Uh oh!

huydt84 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jun 3, 2025

Uh oh!

huydt84 commented Jun 3, 2025

Uh oh!

Uh oh!

ngxson left a comment •

edited

Loading

huydt84 commented Jun 3, 2025 •

edited

Loading