New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma #327

timofeev1995 · 2025-06-12T16:51:06Z

Have you read the Contributing Guidelines?

Issue #

Describe your changes

Clearly and concisely describe what's in this pull request. Include screenshots, if necessary.

mryab · 2025-06-13T08:17:51Z

src/together/resources/finetune.py

@@ -204,7 +215,24 @@ def create_finetune_request(
    if training_method == "sft":
        training_method_cls = TrainingMethodSFT(train_on_inputs=train_on_inputs)
    elif training_method == "dpo":
-        training_method_cls = TrainingMethodDPO(dpo_beta=dpo_beta)
+        if simpo_gamma is not None and simpo_gamma > 0:


By the way, should we raise a ValueError if it's <=0?

Added + added for rpo_alpha (can't imagine an use case for negative values for these parameters)

src/together/resources/finetune.py

src/together/cli/api/finetune.py

VProv · 2025-06-13T14:32:39Z

src/together/resources/finetune.py

+    if rpo_alpha is not None:
+        if training_method != "dpo":
+            raise ValueError("rpo_alpha is only supported for DPO training")
+        if not rpo_alpha >= 0.0:


Maybe it's wise to put an upper limit too

Not sure what can be a limit here, lets say 10? Wdyt?

I'm not sure we should be enforcing any particular limit on this value, although it might be helpful. The problem is that this limit will apply only when users submit jobs via together-python

sbassam · 2025-06-13T14:39:03Z

src/together/resources/finetune.py

+        raise ValueError(
+            "dpo_normalize_logratios_by_length=True is only supported for DPO training"
+        )
+    if rpo_alpha is not None:


this could simply be if rpo_alpha

Not quite, PEP8 explicitly advises against it

A bit below I want to notify user that rpo_alpha==0.0 throws an error

sbassam · 2025-06-13T14:41:54Z

src/together/resources/finetune.py

@@ -765,6 +812,9 @@ async def create(
            training_method (str, optional): Training method. Defaults to "sft".
                Supported methods: "sft", "dpo".
            dpo_beta (float, optional): DPO beta parameter. Defaults to None.
+            dpo_normalize_logratios_by_length (bool): Whether or not normalize logratios by sample lenght. Defaults to False,


length* (sorry for being nit-picky)

timofeev1995 added 7 commits June 12, 2025 17:24

Add dpo improvements arguments

a0270e8

Version bump (tmp, dev)

b92bc17

Implicit setting of reference_free in case if simpo_gamma is set

8e1ee62

Fix unbound variable

7cd0109

Fix

0719212

Force normalization for simpo

5f8b188

Version bump

57c0e16

timofeev1995 requested review from sbassam and VProv June 12, 2025 16:51

timofeev1995 added 2 commits June 12, 2025 18:52

Formatting

d0a9932

Version fix

4702194

mryab reviewed Jun 13, 2025

View reviewed changes

src/together/resources/finetune.py Outdated Show resolved Hide resolved

timofeev1995 added 3 commits June 13, 2025 10:40

Remove reference-free from dpo

a082aac

Review fixes

92d7e01

Formatting

51b96fc

VProv reviewed Jun 13, 2025

View reviewed changes

src/together/cli/api/finetune.py Outdated Show resolved Hide resolved

VProv reviewed Jun 13, 2025

View reviewed changes

VProv approved these changes Jun 13, 2025

View reviewed changes

sbassam approved these changes Jun 13, 2025

View reviewed changes

Fixes

4724824

timofeev1995 changed the title ~~Egor/dpo improvements~~ New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma Jun 14, 2025

timofeev1995 merged commit 5151fd5 into main Jun 16, 2025
10 checks passed

timofeev1995 deleted the egor/dpo-improvements branch June 16, 2025 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma #327

New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma #327

Uh oh!

timofeev1995 commented Jun 12, 2025

Uh oh!

mryab Jun 13, 2025

Uh oh!

timofeev1995 Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

VProv Jun 13, 2025

Uh oh!

timofeev1995 Jun 13, 2025

Uh oh!

mryab Jun 13, 2025

Uh oh!

sbassam Jun 13, 2025

Uh oh!

mryab Jun 13, 2025

Uh oh!

timofeev1995 Jun 13, 2025

Uh oh!

sbassam Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma #327

New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma #327

Uh oh!

Conversation

timofeev1995 commented Jun 12, 2025

Describe your changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!