Skip to content

DarthReca/segmentation-losses-nlp

Repository files navigation

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

Daniele Rege Cambrin1 · Giuseppe Gallipoli1 · Irene Benedetto1 · Luca Cagliero1 · Paolo Garza1

1Politecnico di Torino, Italy

EMNLP 2024 Findings

Paper PDF Focal Focal Focal Focal

This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.

REPOSITORY IN CONSTRUCTION SOME FILES COULD BE MISSING

Getting Started

Install the dependencies of the requirements.txt file. Make sure to edit the config files in the configs/ folder. Then simply run improved_loss.py

With baseline_inference.py, you can run the baseline models used for comparison.

WORKING ON SIMPLIFY THE TRAINING

Input Data

For efficiency, the datasets were tokenized and stored in parquets. The following table represents the schema used for the Parquet files:

Field Type Description
input_ids List[int] Token IDs generated by the tokenizer.
attention_mask List[int] Attention mask generated by the tokenizer.
labels_position_id List[int] Indicates the position where the answer starts. Contains a single element.
  • The tokenizer creates the input_ids and attention_mask.
  • The labels_position_id is used to indicate the starting position of the answer, which is necessary for applying the specific loss.

For more details, refer to the script parquet_creator.py, which can help create these Parquet files.

Resources

This section summarizes all models, datasets, and losses we employed during training.

Dataset

This is the list of datasets tested in the paper. The number of samples is approximated.

Dataset Samples Link
GSM8K 8.5K https://huggingface.co/datasets/gsm8k
MathQA 38K https://huggingface.co/datasets/math_qa
HellaSwag 50K https://huggingface.co/datasets/Rowan/hellaswag
OpenBookQA 6K https://huggingface.co/datasets/openbookqa

Base Models

This is the list of the base models used for the finetuning. They are only pre-trained on a list of known datasets (generally in the report) if the Pre-Training Dataset is Well-Defined in Table. This was done to avoid any overlapping with the finetuning data.

Size License Pre-Training Dataset Link
RedPajama-Incite 3B Apache 2.0 Well-Defined link
StableLM 3B CC BY-SA-4.0 Well-Defined link
RedPajama-Incite 7B Apache 2.0 Well-Defined link
Falcon 7B Apache 2.0 Well-Defined (90%) link
Llama-2 7B Llama-2 Public link

Losses

These are the losses analyzed in the paper and the original papers (read them to understand better how they work). You can find the code for the losses in this repository in the loss folder (and the licenses in loss_licenses folder). The Type taxonomy follows the one proposed by Jun Ma.

Loss Type Link
Cross-Entropy Loss Distribution -
Focal Loss Distribution link
Generalized Dice Loss Region link
Self-Adjusting Dice Loss Combo link
Lovasz Loss Region link

License

This project is licensed under the Apache 2.0 license. See LICENSE for more information.

Citation

If you find this project useful, please consider citing:

@inproceedings{Rege_Cambrin_2024,
   title={Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning},
   url={http://dx.doi.org/10.18653/v1/2024.findings-emnlp.704},
   DOI={10.18653/v1/2024.findings-emnlp.704},
   booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
   publisher={Association for Computational Linguistics},
   author={Rege Cambrin, Daniele and Gallipoli, Giuseppe and Benedetto, Irene and Cagliero, Luca and Garza, Paolo},
   year={2024},
   pages={12060–12079}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages