Open
Description
Can be seen here: https://github.com/instructlab/training/actions/runs/14719818289/job/41311408922
Saving training state: {'current_epoch': 0, 'samples_seen': 6176}
Model state saved in: /tmp/tmprvv1ep3h/checkpoints/full_state/epoch_0
Epoch 0: 100%|█████████████████████████████████| 13/13 [10:07<00:00, 46.70s/it]
Training subprocess has not exited yet. Sending SIGTERM.
Waiting for process to exit, 60s...
It looks like training main entrypoint attempts to clean up after training completed and model saved. But torch process runs and doesn't exit on SIGTERM, so the test times out with failure.
Note: this test run is using tox-current-env
(in attempt to fix issues with flash-attn installation missing torch), but I don't expect it to be necessarily related.
Metadata
Metadata
Assignees
Labels
No labels