There might be a bug when using DeepSpeed with gradient_accumulation_steps > 1

https://github.com/huggingface/accelerate/blob/7013365791eb7b0e63cc87e3a2d87cfeac246e0a/src/accelerate/utils/deepspeed.py#L264-L279

When using DeepSpeed, accelerator.backward() calls DeepSpeed engine.backward() and engine.step().

When gradient_accumulation_steps > 1, the accumulation boundaries of accelerator and DeepSpeed might be different.

https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2362-L2378

![Image](https://github.com/user-attachments/assets/3d17adcf-d287-4459-99b2-e5eb83ba2622)

DeepSpeed engine.step() has its own gradient_accumulation_boundary state which is not synchronized with Accelerate's sync_gradients state.
If this issue does exist, then the Trainer in transformers likely has the same problem.

https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2511-L2513

![Image](https://github.com/user-attachments/assets/5adb784d-33d9-4cc0-b748-70a66b3fbe23)



	def backward(self, loss, **kwargs):
	# runs backpropagation and handles mixed precision
	self.engine.backward(loss, **kwargs)

	# Deepspeed's `engine.step` performs the following operations:
	# - gradient accumulation check
	# - gradient clipping
	# - optimizer step
	# - zero grad
	# - checking overflow
	# - lr_scheduler step (only if engine.lr_scheduler is not None)
	self.engine.step()
	# and this plugin overrides the above calls with no-ops when Accelerate runs under
	# Deepspeed, but allows normal functionality for non-Deepspeed cases thus enabling a simple
	# training loop that works transparently under many training regimes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

There might be a bug when using DeepSpeed with gradient_accumulation_steps > 1 #3552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

There might be a bug when using DeepSpeed with gradient_accumulation_steps > 1 #3552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions