Skip to content

Nanotron, Multilingual tasks update + misc #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 22, 2025
Merged

Conversation

hynky1999
Copy link
Collaborator

Nanotron

Metrics

  • Probability metrics now works with Char normalization
  • Token normalization is fixed when used with transformers

Multlilingual tasks

  • New squads + few new mcf benchmarks + smol fixes to existing tasks

Misc

  • Qa template now only uses unique golds, which speed-upds probs calculation.

@hynky1999 hynky1999 changed the title Nanotron tf update Nanotron, Multilingual tasks update + misc May 20, 2025
@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hynky1999 hynky1999 requested a review from Copilot May 20, 2025 17:39
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves multilingual task support and refines Nanotron‐based model evaluation by updating prompts, configuration, and task adapters while introducing new QA and MCQ datasets and cleaning up deprecated code.

  • Updated QA template to deduplicate choices
  • Improved transformer and Nanotron model handling (e.g. position ids, special tokens, batch sizing)
  • Added new multilingual tasks (GermanQuAD, SQuAD-it, FaQuAD, SQuAD-es, OpenBookQA-es, OAB Exams, ENEM) and adjusted configuration loading

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/lighteval/tasks/templates/qa.py Deduplicates QA choices using set(), updating gold index and choices
src/lighteval/tasks/templates/multichoice.py Ensures answers are cast to string for consistent formatting
src/lighteval/tasks/multilingual/tasks.py Introduces new multilingual task configurations with conditional hf_subset logic
src/lighteval/tasks/multilingual/adapters.py Adds an adapter for ENEM tasks
src/lighteval/tasks/default_prompts.py Removes unused utility function
src/lighteval/pipeline.py Adjusts import paths for updated Nanotron model location
src/lighteval/models/transformers/transformers_model.py Refactors token length handling in loglikelihood computation
src/lighteval/models/nanotron/nanotron_model.py Refactors configuration access, changes default special tokens behavior, and adds position_ids support
src/lighteval/metrics/metrics_sample.py Adds an extra parameter to compute function for improved log-prob normalization
src/lighteval/main_nanotron.py Updates config loading to use YAML SafeLoader and expanded FullNanotronConfig
src/lighteval/config/lighteval_config.py Incorporates updated FullNanotronConfig supporting new Nanotron configuration components
Comments suppressed due to low confidence (2)

src/lighteval/metrics/metrics_sample.py:346

  • The new parameter 'reference_texts' should be documented to clarify its usage and expected format in the compute function's docstring.
def compute(self, logprobs: list[float], target_tokens: list[list[int]], reference_texts: list[str], **kwargs) -> float:

src/lighteval/models/nanotron/nanotron_model.py:99

  • Changing the default value of add_special_tokens from True to False could affect tokenization outputs if the model was originally trained with special tokens. Verify that downstream processing and model performance remain correct with this new default.
add_special_tokens: Optional[bool] = False,

@hynky1999 hynky1999 requested a review from Copilot May 20, 2025 17:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates Nanotron support, enhances metrics normalization and testing, expands multilingual task coverage with new QA/MCQ benchmarks, and applies miscellaneous code improvements.

  • Nanotron: switch to YAML config parsing, adjust model/tokenizer args, remove unused env config, refine batch and tokenization logic
  • Metrics & Tests: accept reference_texts in probability metrics and update unit tests
  • Multilingual Tasks: import enem_adapter, add GermanQuAD, SQuAD-it, FaQuAD, OpenBook QA ES, OAB Exams, ENEM tasks, and handle Japanese subset tags
  • Misc: deduplicate QA choices, cast choices to str, remove unused utilities, refactor prompt templates and dataclasses

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_unit_base_metrics.py Include reference_texts argument in metric tests
src/lighteval/tasks/templates/qa.py Deduplicate MCQ choices using set
src/lighteval/tasks/templates/multichoice.py Cast each choice to str to avoid non-string formatting issues
src/lighteval/tasks/multilingual/tasks.py Import enem_adapter, add several multilingual QA/MCQ task configs, adjust subset tags
src/lighteval/tasks/multilingual/adapters.py Add enem_adapter for ENEM dataset
src/lighteval/tasks/default_prompts.py Remove unused get_drop_date utility
src/lighteval/pipeline.py Update Nanotron import path and parameter renaming
src/lighteval/models/transformers/transformers_model.py Refactor log-likelihood input length handling and pad/gather logic
src/lighteval/models/nanotron/nanotron_model.py Rename config attributes, remove override_bs parameters, assert batch settings
src/lighteval/metrics/metrics_sample.py Extend probability metric compute to accept reference_texts
src/lighteval/main_nanotron.py Change Nanotron config loading to YAML, split into model/tokenizer/general configs
src/lighteval/config/lighteval_config.py Switch to Pydantic for generation args, update FullNanotronConfig fields
Comments suppressed due to low confidence (2)

tests/test_unit_base_metrics.py:196

  • The new reference_texts parameter is only tested with its default None value. Add a test case where reference_texts is non-None to ensure the metric branches that use it are covered.
result = prob_metric.sample_level_fn(logprobs=np.log([0.7]), target_tokens=None, reference_texts=None)

src/lighteval/tasks/multilingual/tasks.py:3009

  • [nitpick] The Japanese subset tag uses 'jap' here, but elsewhere (e.g., xwinograd) it uses 'jp'. Unify the tag naming for consistency across tasks.
hf_subset=f"X-CSQA-{standardize_tag(language.value) if language != Language.JAPANESE else 'jap'}",

@hynky1999 hynky1999 requested review from anton-l and NathanHB May 20, 2025 17:54
Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits (some dead code to rm, some comments to add, and change your asserts) but overall lgtm. Great job!

  • Do we want to pin nanotron to an above version then the one we're currently using?
  • If you had the time to add a nanotron model to the test suite that would be fire but if not can be another PR

@@ -343,7 +337,12 @@ def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)

def _model_call(self, inputs: torch.Tensor) -> torch.Tensor:
return self.model(inputs)
position_ids = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain why it's needed (I'm a bit curious about the int32 dtype?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Tokenize the context and continuation and compute the log likelihood of those
tokenized sequences.
"""
# requests = requests[:10256]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm

Comment on lines +489 to +492
assert (
full_attention_masks is False
), "full_attention_masks=True means we would be doing attention of padding tokens, which would affect negatively the results."
assert pad_on_left is False, "pad_on_left=True not supported yet, see TODOs below"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid asserts in the code as they can be removed when launching the code with the -O flag to get something more optimized.
I would use logger.warning here and reassign to the expected values, does not feel like it should break the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should let the evaluation continue, if we get inaccurate results though, waste of compute wdyt ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why we should reassign to the correct values and tell the users we did so instead of breaking - or do you want to just stop anyway and force the user to set these params correctly? (in which case raise an error instead of using an assert)

else:
# We need to remove some tokens
padding_length = padding_length - (padding_length % self.parallel_config.tp)
assert (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines +511 to +529
# if max_context % self.parallel_config.tp != 0:
# # We need to round up to the next multiple of self.parallel_config.tp
# if (max_context + (self.parallel_config.tp - max_context % self.parallel_config.tp)) < self.max_length:
# # We can add some tokens
# max_context = max_context + (self.parallel_config.tp - max_context % self.parallel_config.tp)
# else:
# # We need to remove some tokens
# max_context = max_context - (max_context % self.parallel_config.tp)

# if padding_length % self.parallel_config.tp != 0:
# # We need to round up to the next multiple of self.parallel_config.tp
# if (
# padding_length + (self.parallel_config.tp - padding_length % self.parallel_config.tp)
# ) < self.max_length:
# # We can add some tokens
# padding_length = padding_length + (self.parallel_config.tp - padding_length % self.parallel_config.tp)
# else:
# # We need to remove some tokens
# padding_length = padding_length - (padding_length % self.parallel_config.tp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @NouamaneTazi, do we want to keep this ?

@hynky1999
Copy link
Collaborator Author

hynky1999 commented May 21, 2025

Do we want to pin nanotron to an above version then the one we're currently using?

Ideally, but there is no release to pin this to :/ The last nanotron release if from march last year. cc @NouamaneTazi if you can release a new version after all changes are settled so that we can pin

If you had the time to add a nanotron model to the test suite that would be fire but if not can be another PR

Sorry I don't have time for that

@clefourrier
Copy link
Member

fair enough re the test suite

@NathanHB NathanHB merged commit 034c23b into main May 22, 2025
5 checks passed
@NathanHB NathanHB added the feature/enhancement New feature/request label May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/enhancement New feature/request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants