GitHub - DarthReca/segmentation-losses-nlp

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

Daniele Rege Cambrin¹ · Giuseppe Gallipoli¹ · Irene Benedetto¹ · Luca Cagliero¹ · Paolo Garza¹

¹Politecnico di Torino, Italy

EMNLP 2024 Findings

This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.

REPOSITORY IN CONSTRUCTION SOME FILES COULD BE MISSING

Getting Started

Install the dependencies of the requirements.txt file. Make sure to edit the config files in the configs/ folder. Then simply run improved_loss.py

With baseline_inference.py, you can run the baseline models used for comparison.

WORKING ON SIMPLIFY THE TRAINING

Input Data

For efficiency, the datasets were tokenized and stored in parquets. The following table represents the schema used for the Parquet files:

Field	Type	Description
`input_ids`	`List[int]`	Token IDs generated by the tokenizer.
`attention_mask`	`List[int]`	Attention mask generated by the tokenizer.
`labels_position_id`	`List[int]`	Indicates the position where the answer starts. Contains a single element.

The tokenizer creates the input_ids and attention_mask.
The labels_position_id is used to indicate the starting position of the answer, which is necessary for applying the specific loss.

For more details, refer to the script parquet_creator.py, which can help create these Parquet files.

Resources

This section summarizes all models, datasets, and losses we employed during training.

Dataset

This is the list of datasets tested in the paper. The number of samples is approximated.

Dataset	Samples	Link
GSM8K	8.5K	https://huggingface.co/datasets/gsm8k
MathQA	38K	https://huggingface.co/datasets/math_qa
HellaSwag	50K	https://huggingface.co/datasets/Rowan/hellaswag
OpenBookQA	6K	https://huggingface.co/datasets/openbookqa

Base Models

This is the list of the base models used for the finetuning. They are only pre-trained on a list of known datasets (generally in the report) if the Pre-Training Dataset is Well-Defined in Table. This was done to avoid any overlapping with the finetuning data.

	Size	License	Pre-Training Dataset	Link
RedPajama-Incite	3B	Apache 2.0	Well-Defined	link
StableLM	3B	CC BY-SA-4.0	Well-Defined	link
RedPajama-Incite	7B	Apache 2.0	Well-Defined	link
Falcon	7B	Apache 2.0	Well-Defined (90%)	link
Llama-2	7B	Llama-2	Public	link

Losses

These are the losses analyzed in the paper and the original papers (read them to understand better how they work). You can find the code for the losses in this repository in the loss folder (and the licenses in loss_licenses folder). The Type taxonomy follows the one proposed by Jun Ma.

Loss	Type	Link
Cross-Entropy Loss	Distribution	-
Focal Loss	Distribution	link
Generalized Dice Loss	Region	link
Self-Adjusting Dice Loss	Combo	link
Lovasz Loss	Region	link

License

This project is licensed under the Apache 2.0 license. See LICENSE for more information.

Citation

If you find this project useful, please consider citing:

@inproceedings{Rege_Cambrin_2024,
   title={Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning},
   url={http://dx.doi.org/10.18653/v1/2024.findings-emnlp.704},
   DOI={10.18653/v1/2024.findings-emnlp.704},
   booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
   publisher={Association for Computational Linguistics},
   author={Rege Cambrin, Daniele and Gallipoli, Giuseppe and Benedetto, Irene and Cagliero, Luca and Garza, Paolo},
   year={2024},
   pages={12060–12079}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

Getting Started

Input Data

Resources

Dataset

Base Models

Losses

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
loss		loss
loss_licenses		loss_licenses
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline_inference.py		baseline_inference.py
improved_loss.py		improved_loss.py
parquet_creator.py		parquet_creator.py
requirements.txt		requirements.txt

License

DarthReca/segmentation-losses-nlp

Folders and files

Latest commit

History

Repository files navigation

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

Getting Started

Input Data

Resources

Dataset

Base Models

Losses

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages