Skip to content

test_training_feature_matrix[False-DistributedBackend.FSDP] fails with "RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context." #505

Open
@booxter

Description

@booxter

Can be seen here: https://github.com/instructlab/training/actions/runs/14719818289/job/41311408922

Saving training state: {'current_epoch': 0, 'samples_seen': 6176}
Model state saved in: /tmp/tmprvv1ep3h/checkpoints/full_state/epoch_0
Epoch 0: 100%|█████████████████████████████████| 13/13 [10:07<00:00, 46.70s/it]
Training subprocess has not exited yet. Sending SIGTERM.
Waiting for process to exit, 60s...

It looks like training main entrypoint attempts to clean up after training completed and model saved. But torch process runs and doesn't exit on SIGTERM, so the test times out with failure.

Note: this test run is using tox-current-env (in attempt to fix issues with flash-attn installation missing torch), but I don't expect it to be necessarily related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions