Skip to content

convert: handle when model's tokenization method relies on Mecab #13830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

huydt84
Copy link
Contributor

@huydt84 huydt84 commented May 27, 2025

This PR add more supports to Japanese-based models (especially BertJapanese) via:

  • Auto install fugashi[unidic-lite] when model's tokenization method relies on Mecab
  • Only print the "pre_tokenizer" content from the tokenizer.json, if the file exists
  • download_model function can work with other files if 1 file doesn't exist (since many BertJapanese models don't have tokenizer.json, which can disrupt downloading process

@github-actions github-actions bot added the python python script changes label May 27, 2025
@huydt84 huydt84 requested a review from ngxson May 27, 2025 22:58
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that other models also use the same script, try not to introduce destructive changes that may affect other models.

Comment on lines 153 to 161
except requests.HTTPError as e:
if e.response.status_code == 404:
logger.warning(f"URL not found: {url}")
else:
logger.error(f"HTTP error occurred when downloading {url}: {e}")
except requests.ConnectionError:
logger.error(f"Connection error occurred when downloading {url}")
except Exception as e:
logger.error(f"Unexpected error occurred when downloading {url}: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This whole multiple except can be just one single except Exception as e. No need to over-engineer the error handling if you only interested in logging it
  2. The old code doesn't have this handling, so it will simply terminate the script if there an error. Now with this, error will be ignored. I think this is not the expected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't have access to many models in the list, so the script will be terminated everytime I run it (without commenting other models). The instructions at the beginning of the file state that Add a new model to the "models" list, which may make users confused

What is your suggestion about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually just temporary comment out all the other models then run the script. But yes having the ability to update only added model will be a better approach. I will add it in another PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's simply remove this change from this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this change.

I will add it in another PR

Thank you in advance!

if "ignore_merges" in cfg["model"]:
logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4))
# print the "pre_tokenizer" content from the tokenizer.json, if exists
if os.path.isfile(f"models/tokenizers/{name}/tokenizer.json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will alter the behavior of other models

instead, check for cfg["word_tokenizer_type"] == "mecab" and only skip this on that particular model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I just fixed that

Comment on lines 25 to 26
import subprocess
import importlib.util
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it

@@ -117,17 +118,47 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", },
{"name": "pixtral", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", },
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
{"name": "ruri-large", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/cl-nagoya/ruri-large", },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and btw, do we even have the CPP code to handle this? is this already tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that model and similar models (ruri-*) locally for embedding task and it worked.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to convert_hf_to_gguf. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models require vocab.txt)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When #13847 is merged, you can run the script again and this time it will only process the newly added model

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 2, 2025

@ngxson Please check again.

@huydt84 huydt84 requested a review from ngxson June 2, 2025 12:54
@CISC
Copy link
Collaborator

CISC commented Jun 3, 2025

Hmmm, wait, is this PR just to tell the user how to install fugashi for BertJapaneseTokenizer? If so, this won't work, users won't be running convert_hf_to_gguf_update.py, and even if they did, that code won't run now.

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 3, 2025

@CISC if BertJapanese models (or other Japanese-based models) use Mecab tokenizer, fugashi library and unidic dictionary (or corresponding solutions) must be installed in order to make AutoTokenizer work normally

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 3, 2025

The PR name is a bit confusing

@huydt84 huydt84 changed the title convert: add support for Japanese Bert model convert: handle when model's tokenization method relies on Mecab Jun 3, 2025
@CISC
Copy link
Collaborator

CISC commented Jun 3, 2025

Right, but this PR fails to do that ATM, and I'm not sure you can cleanly add it to convert_hf_to_gguf.py where it belongs either, is there some check you can do in the model class, or does conversion crash before that?

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 3, 2025

The conversion crashes when AutoTokenizer is loaded. This problem is not dependent on model class, but the tokenizer type in config.json

@CISC
Copy link
Collaborator

CISC commented Jun 3, 2025

Yeah, I was afraid of that, here right?

def get_vocab_base(self) -> tuple[list[str], list[int], str]:
tokens: list[str] = []
toktypes: list[int] = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)

So, unfortunately I think you have to scrap everything, the current changes are incorrect and doesn't solve the problem.

I took a look though, and why do we need to emit our own notice? Doesn't AutoTokenizer already do that?
https://github.com/huggingface/transformers/blob/55ec319de6a90c7b8db1218a5d73837fc6098371/src/transformers/models/bert_japanese/tokenization_bert_japanese.py#L407-L413

@huydt84
Copy link
Contributor Author

huydt84 commented Jun 3, 2025

Doesn't AutoTokenizer already do that?

Wow, I actually don't know about that :) So the changes in this PR is meaningless (also have some trivial fixes but I think the main purpose is not achieved)

@huydt84 huydt84 closed this Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants