feat: refactor main_ds.py (1/n) Model class #572

cdoern · 2025-05-27T19:35:54Z

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class

NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.

The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support

These classes are one of a few steps needed to "SDK-ify" the training library

Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:

Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.

Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Model class NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change. The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support These classes are one of a few steps needed to "SDK-ify" the training library Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale: Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another. Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code. Signed-off-by: Charlie Doern <[email protected]>

RobotSail · 2025-05-28T00:58:51Z

src/instructlab/training/config.py

+class ModelTypes(Enum):
+    LIGER = "Liger"
+    CAUSALLM = "Causallm"
+    DOLOMITE = "Dolomite"


We've dropped dolomite, no need to include this.

@RobotSail Interesting! What does it mean exactly? If I grep through the code, I still see hits for dolomite, including the mandatory dependency on instructlab-dolomite. Was some decision made to drop it? Should we clean these remnants from the tree then?

mergify · 2025-05-28T16:50:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

booxter

I haven't reviewed tests or Accelerator class in detail. I need to step off this PR. Posting questions and concerns I have collected so far.

booxter · 2025-05-30T13:49:07Z

src/instructlab/training/main_ds.py

+    parser.add_argument(
+        "--model-class",
+        type=str,
+        default=ModelTypes.CAUSALLM.value,


nit: you can use choice=[x.value for x in enum] to avoid listing them below

booxter · 2025-05-30T13:50:47Z

src/instructlab/training/config.py

@@ -141,6 +141,19 @@ class FSDPOptions(BaseModel):
    sharding_strategy: ShardingStrategies = ShardingStrategies.HYBRID_SHARD


+class Optimizers(Enum):


(No action required, Observation) I think it's more common to call enums as singular, not plural. But it's a matter of habit of course.

booxter · 2025-05-30T13:51:17Z

src/instructlab/training/config.py

+# public API
+class ModelTypes(Enum):
+    LIGER = "Liger"
+    CAUSALLM = "Causallm"


should we use "correct" case? CausalLM?

booxter · 2025-05-30T13:54:25Z