-
Notifications
You must be signed in to change notification settings - Fork 265
Nanotron, Multilingual tasks update + misc #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves multilingual task support and refines Nanotron‐based model evaluation by updating prompts, configuration, and task adapters while introducing new QA and MCQ datasets and cleaning up deprecated code.
- Updated QA template to deduplicate choices
- Improved transformer and Nanotron model handling (e.g. position ids, special tokens, batch sizing)
- Added new multilingual tasks (GermanQuAD, SQuAD-it, FaQuAD, SQuAD-es, OpenBookQA-es, OAB Exams, ENEM) and adjusted configuration loading
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/lighteval/tasks/templates/qa.py | Deduplicates QA choices using set(), updating gold index and choices |
src/lighteval/tasks/templates/multichoice.py | Ensures answers are cast to string for consistent formatting |
src/lighteval/tasks/multilingual/tasks.py | Introduces new multilingual task configurations with conditional hf_subset logic |
src/lighteval/tasks/multilingual/adapters.py | Adds an adapter for ENEM tasks |
src/lighteval/tasks/default_prompts.py | Removes unused utility function |
src/lighteval/pipeline.py | Adjusts import paths for updated Nanotron model location |
src/lighteval/models/transformers/transformers_model.py | Refactors token length handling in loglikelihood computation |
src/lighteval/models/nanotron/nanotron_model.py | Refactors configuration access, changes default special tokens behavior, and adds position_ids support |
src/lighteval/metrics/metrics_sample.py | Adds an extra parameter to compute function for improved log-prob normalization |
src/lighteval/main_nanotron.py | Updates config loading to use YAML SafeLoader and expanded FullNanotronConfig |
src/lighteval/config/lighteval_config.py | Incorporates updated FullNanotronConfig supporting new Nanotron configuration components |
Comments suppressed due to low confidence (2)
src/lighteval/metrics/metrics_sample.py:346
- The new parameter 'reference_texts' should be documented to clarify its usage and expected format in the compute function's docstring.
def compute(self, logprobs: list[float], target_tokens: list[list[int]], reference_texts: list[str], **kwargs) -> float:
src/lighteval/models/nanotron/nanotron_model.py:99
- Changing the default value of add_special_tokens from True to False could affect tokenization outputs if the model was originally trained with special tokens. Verify that downstream processing and model performance remain correct with this new default.
add_special_tokens: Optional[bool] = False,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates Nanotron support, enhances metrics normalization and testing, expands multilingual task coverage with new QA/MCQ benchmarks, and applies miscellaneous code improvements.
- Nanotron: switch to YAML config parsing, adjust model/tokenizer args, remove unused env config, refine batch and tokenization logic
- Metrics & Tests: accept
reference_texts
in probability metrics and update unit tests - Multilingual Tasks: import
enem_adapter
, add GermanQuAD, SQuAD-it, FaQuAD, OpenBook QA ES, OAB Exams, ENEM tasks, and handle Japanese subset tags - Misc: deduplicate QA choices, cast choices to
str
, remove unused utilities, refactor prompt templates and dataclasses
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/test_unit_base_metrics.py | Include reference_texts argument in metric tests |
src/lighteval/tasks/templates/qa.py | Deduplicate MCQ choices using set |
src/lighteval/tasks/templates/multichoice.py | Cast each choice to str to avoid non-string formatting issues |
src/lighteval/tasks/multilingual/tasks.py | Import enem_adapter , add several multilingual QA/MCQ task configs, adjust subset tags |
src/lighteval/tasks/multilingual/adapters.py | Add enem_adapter for ENEM dataset |
src/lighteval/tasks/default_prompts.py | Remove unused get_drop_date utility |
src/lighteval/pipeline.py | Update Nanotron import path and parameter renaming |
src/lighteval/models/transformers/transformers_model.py | Refactor log-likelihood input length handling and pad/gather logic |
src/lighteval/models/nanotron/nanotron_model.py | Rename config attributes, remove override_bs parameters, assert batch settings |
src/lighteval/metrics/metrics_sample.py | Extend probability metric compute to accept reference_texts |
src/lighteval/main_nanotron.py | Change Nanotron config loading to YAML, split into model/tokenizer/general configs |
src/lighteval/config/lighteval_config.py | Switch to Pydantic for generation args, update FullNanotronConfig fields |
Comments suppressed due to low confidence (2)
tests/test_unit_base_metrics.py:196
- The new
reference_texts
parameter is only tested with its defaultNone
value. Add a test case wherereference_texts
is non-None
to ensure the metric branches that use it are covered.
result = prob_metric.sample_level_fn(logprobs=np.log([0.7]), target_tokens=None, reference_texts=None)
src/lighteval/tasks/multilingual/tasks.py:3009
- [nitpick] The Japanese subset tag uses
'jap'
here, but elsewhere (e.g., xwinograd) it uses'jp'
. Unify the tag naming for consistency across tasks.
hf_subset=f"X-CSQA-{standardize_tag(language.value) if language != Language.JAPANESE else 'jap'}",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple nits (some dead code to rm, some comments to add, and change your asserts) but overall lgtm. Great job!
- Do we want to pin nanotron to an above version then the one we're currently using?
- If you had the time to add a nanotron model to the test suite that would be fire but if not can be another PR
@@ -343,7 +337,12 @@ def tok_decode(self, tokens: torch.LongTensor) -> List[str]: | |||
return self.tokenizer.batch_decode(tokens, skip_special_tokens=True) | |||
|
|||
def _model_call(self, inputs: torch.Tensor) -> torch.Tensor: | |||
return self.model(inputs) | |||
position_ids = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to explain why it's needed (I'm a bit curious about the int32 dtype?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Tokenize the context and continuation and compute the log likelihood of those | ||
tokenized sequences. | ||
""" | ||
# requests = requests[:10256] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm
assert ( | ||
full_attention_masks is False | ||
), "full_attention_masks=True means we would be doing attention of padding tokens, which would affect negatively the results." | ||
assert pad_on_left is False, "pad_on_left=True not supported yet, see TODOs below" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We avoid asserts in the code as they can be removed when launching the code with the -O flag to get something more optimized.
I would use logger.warning here and reassign to the expected values, does not feel like it should break the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we should let the evaluation continue, if we get inaccurate results though, waste of compute wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why we should reassign to the correct values and tell the users we did so instead of breaking - or do you want to just stop anyway and force the user to set these params correctly? (in which case raise an error instead of using an assert)
else: | ||
# We need to remove some tokens | ||
padding_length = padding_length - (padding_length % self.parallel_config.tp) | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
# if max_context % self.parallel_config.tp != 0: | ||
# # We need to round up to the next multiple of self.parallel_config.tp | ||
# if (max_context + (self.parallel_config.tp - max_context % self.parallel_config.tp)) < self.max_length: | ||
# # We can add some tokens | ||
# max_context = max_context + (self.parallel_config.tp - max_context % self.parallel_config.tp) | ||
# else: | ||
# # We need to remove some tokens | ||
# max_context = max_context - (max_context % self.parallel_config.tp) | ||
|
||
# if padding_length % self.parallel_config.tp != 0: | ||
# # We need to round up to the next multiple of self.parallel_config.tp | ||
# if ( | ||
# padding_length + (self.parallel_config.tp - padding_length % self.parallel_config.tp) | ||
# ) < self.max_length: | ||
# # We can add some tokens | ||
# padding_length = padding_length + (self.parallel_config.tp - padding_length % self.parallel_config.tp) | ||
# else: | ||
# # We need to remove some tokens | ||
# padding_length = padding_length - (padding_length % self.parallel_config.tp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @NouamaneTazi, do we want to keep this ?
Ideally, but there is no release to pin this to :/ The last nanotron release if from march last year. cc @NouamaneTazi if you can release a new version after all changes are settled so that we can pin
Sorry I don't have time for that |
fair enough re the test suite |
Nanotron
Metrics
Multlilingual tasks
Misc