Daniele Rege Cambrin1 · Giuseppe Gallipoli1 · Irene Benedetto1 · Luca Cagliero1 · Paolo Garza1
1Politecnico di Torino, Italy
This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.
REPOSITORY IN CONSTRUCTION SOME FILES COULD BE MISSING
Install the dependencies of the requirements.txt file. Make sure to edit the config files in the configs/
folder. Then simply run improved_loss.py
With baseline_inference.py, you can run the baseline models used for comparison.
WORKING ON SIMPLIFY THE TRAINING
For efficiency, the datasets were tokenized and stored in parquets. The following table represents the schema used for the Parquet files:
Field | Type | Description |
---|---|---|
input_ids |
List[int] |
Token IDs generated by the tokenizer. |
attention_mask |
List[int] |
Attention mask generated by the tokenizer. |
labels_position_id |
List[int] |
Indicates the position where the answer starts. Contains a single element. |
- The tokenizer creates the
input_ids
andattention_mask
. - The
labels_position_id
is used to indicate the starting position of the answer, which is necessary for applying the specific loss.
For more details, refer to the script parquet_creator.py
, which can help create these Parquet files.
This section summarizes all models, datasets, and losses we employed during training.
This is the list of datasets tested in the paper. The number of samples is approximated.
Dataset | Samples | Link |
---|---|---|
GSM8K | 8.5K | https://huggingface.co/datasets/gsm8k |
MathQA | 38K | https://huggingface.co/datasets/math_qa |
HellaSwag | 50K | https://huggingface.co/datasets/Rowan/hellaswag |
OpenBookQA | 6K | https://huggingface.co/datasets/openbookqa |
This is the list of the base models used for the finetuning. They are only pre-trained on a list of known datasets (generally in the report) if the Pre-Training Dataset is Well-Defined in Table. This was done to avoid any overlapping with the finetuning data.
Size | License | Pre-Training Dataset | Link | |
---|---|---|---|---|
RedPajama-Incite | 3B | Apache 2.0 | Well-Defined | link |
StableLM | 3B | CC BY-SA-4.0 | Well-Defined | link |
RedPajama-Incite | 7B | Apache 2.0 | Well-Defined | link |
Falcon | 7B | Apache 2.0 | Well-Defined (90%) | link |
Llama-2 | 7B | Llama-2 | Public | link |
These are the losses analyzed in the paper and the original papers (read them to understand better how they work). You can find the code for the losses in this repository in the loss folder (and the licenses in loss_licenses folder). The Type taxonomy follows the one proposed by Jun Ma.
Loss | Type | Link |
---|---|---|
Cross-Entropy Loss | Distribution | - |
Focal Loss | Distribution | link |
Generalized Dice Loss | Region | link |
Self-Adjusting Dice Loss | Combo | link |
Lovasz Loss | Region | link |
This project is licensed under the Apache 2.0 license. See LICENSE for more information.
If you find this project useful, please consider citing:
@inproceedings{Rege_Cambrin_2024,
title={Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning},
url={http://dx.doi.org/10.18653/v1/2024.findings-emnlp.704},
DOI={10.18653/v1/2024.findings-emnlp.704},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
publisher={Association for Computational Linguistics},
author={Rege Cambrin, Daniele and Gallipoli, Giuseppe and Benedetto, Irene and Cagliero, Luca and Garza, Paolo},
year={2024},
pages={12060–12079}
}