Nanotron, Multilingual tasks update + misc #756

hynky1999 · 2025-05-20T17:31:38Z

Nanotron

Logprobs working
Generation works with https://github.com/huggingface/nanotron/tree/nouamane/lighteval-fix, however sometimes there is issue with ordering.
Only relevant part of config are fetched to prevent some dataclass checks which are not needed in nanotron
Smol changes to configs

Metrics

Probability metrics now works with Char normalization
Token normalization is fixed when used with transformers

Multlilingual tasks

New squads + few new mcf benchmarks + smol fixes to existing tasks

Misc

Qa template now only uses unique golds, which speed-upds probs calculation.

…r probs

HuggingFaceDocBuilderDev · 2025-05-20T17:34:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull Request Overview

This PR improves multilingual task support and refines Nanotron‐based model evaluation by updating prompts, configuration, and task adapters while introducing new QA and MCQ datasets and cleaning up deprecated code.

Updated QA template to deduplicate choices
Improved transformer and Nanotron model handling (e.g. position ids, special tokens, batch sizing)
Added new multilingual tasks (GermanQuAD, SQuAD-it, FaQuAD, SQuAD-es, OpenBookQA-es, OAB Exams, ENEM) and adjusted configuration loading

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/lighteval/tasks/templates/qa.py	Deduplicates QA choices using set(), updating gold index and choices
src/lighteval/tasks/templates/multichoice.py	Ensures answers are cast to string for consistent formatting
src/lighteval/tasks/multilingual/tasks.py	Introduces new multilingual task configurations with conditional hf_subset logic
src/lighteval/tasks/multilingual/adapters.py	Adds an adapter for ENEM tasks
src/lighteval/tasks/default_prompts.py	Removes unused utility function
src/lighteval/pipeline.py	Adjusts import paths for updated Nanotron model location
src/lighteval/models/transformers/transformers_model.py	Refactors token length handling in loglikelihood computation
src/lighteval/models/nanotron/nanotron_model.py	Refactors configuration access, changes default special tokens behavior, and adds position_ids support
src/lighteval/metrics/metrics_sample.py	Adds an extra parameter to compute function for improved log-prob normalization
src/lighteval/main_nanotron.py	Updates config loading to use YAML SafeLoader and expanded FullNanotronConfig
src/lighteval/config/lighteval_config.py	Incorporates updated FullNanotronConfig supporting new Nanotron configuration components

Comments suppressed due to low confidence (2)

src/lighteval/metrics/metrics_sample.py:346

The new parameter 'reference_texts' should be documented to clarify its usage and expected format in the compute function's docstring.

def compute(self, logprobs: list[float], target_tokens: list[list[int]], reference_texts: list[str], **kwargs) -> float:

src/lighteval/models/nanotron/nanotron_model.py:99

Changing the default value of add_special_tokens from True to False could affect tokenization outputs if the model was originally trained with special tokens. Verify that downstream processing and model performance remain correct with this new default.

add_special_tokens: Optional[bool] = False,

src/lighteval/tasks/templates/qa.py

Copilot

Pull Request Overview

This PR integrates Nanotron support, enhances metrics normalization and testing, expands multilingual task coverage with new QA/MCQ benchmarks, and applies miscellaneous code improvements.

Nanotron: switch to YAML config parsing, adjust model/tokenizer args, remove unused env config, refine batch and tokenization logic
Metrics & Tests: accept reference_texts in probability metrics and update unit tests
Multilingual Tasks: import enem_adapter, add GermanQuAD, SQuAD-it, FaQuAD, OpenBook QA ES, OAB Exams, ENEM tasks, and handle Japanese subset tags
Misc: deduplicate QA choices, cast choices to str, remove unused utilities, refactor prompt templates and dataclasses

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_unit_base_metrics.py	Include `reference_texts` argument in metric tests
src/lighteval/tasks/templates/qa.py	Deduplicate MCQ choices using `set`
src/lighteval/tasks/templates/multichoice.py	Cast each choice to `str` to avoid non-string formatting issues
src/lighteval/tasks/multilingual/tasks.py	Import `enem_adapter`, add several multilingual QA/MCQ task configs, adjust subset tags
src/lighteval/tasks/multilingual/adapters.py	Add `enem_adapter` for ENEM dataset
src/lighteval/tasks/default_prompts.py	Remove unused `get_drop_date` utility
src/lighteval/pipeline.py	Update Nanotron import path and parameter renaming
src/lighteval/models/transformers/transformers_model.py	Refactor log-likelihood input length handling and pad/gather logic
src/lighteval/models/nanotron/nanotron_model.py	Rename config attributes, remove `override_bs` parameters, assert batch settings
src/lighteval/metrics/metrics_sample.py	Extend probability metric `compute` to accept `reference_texts`
src/lighteval/main_nanotron.py	Change Nanotron config loading to YAML, split into model/tokenizer/general configs
src/lighteval/config/lighteval_config.py	Switch to Pydantic for generation args, update `FullNanotronConfig` fields

Comments suppressed due to low confidence (2)

tests/test_unit_base_metrics.py:196

The new reference_texts parameter is only tested with its default None value. Add a test case where reference_texts is non-None to ensure the metric branches that use it are covered.

result = prob_metric.sample_level_fn(logprobs=np.log([0.7]), target_tokens=None, reference_texts=None)

src/lighteval/tasks/multilingual/tasks.py:3009

[nitpick] The Japanese subset tag uses 'jap' here, but elsewhere (e.g., xwinograd) it uses 'jp'. Unify the tag naming for consistency across tasks.

hf_subset=f"X-CSQA-{standardize_tag(language.value) if language != Language.JAPANESE else 'jap'}",

src/lighteval/tasks/templates/qa.py