Resume training no longer works in V25? #3203

phil291 · 2025-04-26T14:10:04Z

phil291
Apr 26, 2025

I've used for training a SDXL-Lora (1024,1024) the following helpful options for an RTX4060 with only 8GB VRAM:
Gradient checkpointing, Memory efficient Attention, Full bf16 training (experimental)

Now with the V25.X-Gui I enabled "Cache text encoder Outputs"

No more sharing Memory, speedup from 15-25s/step to incredible 1.5s per step.
Amazing!

But it seems that resuming a previously trained project starts from beginning, although the logs throw out that all steps are loaded from the given directory.

Any clues on this?

Answered by phil291

May 4, 2025

Digged deeper, resume training does work in V25.

Proof:

The resumed training (blue in graphs) does not do a LR-warmup, avg-loss is continued on previous training (orange).

Data-set included 68 images, max step set to 6800 on beginning.
Resumed training-cfg was identical to initial, except the new max step 20400 (6800x3) and adding the last state-directory.

Coclusion:
Bug in kohya_ss scripts, as stated in above link.
Resuming requires a max step or max epoch bigger then already achieved in "last state".
Training will stop regularly at step 13600 when new "max steps/epochs" minus already trained steps/epochs of "last state" (20400-6800) is reached, so only the difference is trained.

The i…

View full answer

phil291 · 2025-04-29T11:43:20Z

phil291
Apr 29, 2025
Author

I've had a look at the kohya-ss sd-scripts git, the issue described there I've also noticed previously.
kohya-ss/sd-scripts#1999

Maybe in V25 here things got worse as the issue there hasn't been resolved yet?

1 reply

phil291 May 4, 2025
Author

Digged deeper, resume training does work in V25.

Proof:

The resumed training (blue in graphs) does not do a LR-warmup, avg-loss is continued on previous training (orange).

Data-set included 68 images, max step set to 6800 on beginning.
Resumed training-cfg was identical to initial, except the new max step 20400 (6800x3) and adding the last state-directory.

Coclusion:
Bug in kohya_ss scripts, as stated in above link.
Resuming requires a max step or max epoch bigger then already achieved in "last state".
Training will stop regularly at step 13600 when new "max steps/epochs" minus already trained steps/epochs of "last state" (20400-6800) is reached, so only the difference is trained.

The incomplete progress bar and remaining time therefore indicate the "step/epoch calculation bug", which is exactly one third.

So the resuming is fully functional for my needs, the final result of the resumed training has indeed very improved, compared to the initial training.

Answer selected by phil291

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resume training no longer works in V25? #3203

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Resume training no longer works in V25? #3203

Uh oh!

phil291 Apr 26, 2025

Replies: 1 comment · 1 reply

Uh oh!

phil291 Apr 29, 2025 Author

Uh oh!

Uh oh!

phil291 May 4, 2025 Author

phil291
Apr 26, 2025

Replies: 1 comment 1 reply

phil291
Apr 29, 2025
Author

phil291 May 4, 2025
Author