[Question] how to do layer-wise caching of activations and gradients in case of low GPU memory when working with larger models?

I'm trying to run attribution patching with logit difference as the metric on the Qwen 7B Instruct model using 2 A6000 GPUs (48 GB each). My implementation is similar to the attribution patching notebook in the demos, but with a sequence length of about 300 tokens. To reduce memory usage, I attempted to cache only activations and gradients layer-wise by setting hooks for each layer independently through multiple forward and backward calls. However, this also leads to Cuda OOM errors. Would appreciate any help or suggestions for the above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] how to do layer-wise caching of activations and gradients in case of low GPU memory when working with larger models? #878

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] how to do layer-wise caching of activations and gradients in case of low GPU memory when working with larger models? #878

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions