Open
Description
I'm trying to run attribution patching with logit difference as the metric on the Qwen 7B Instruct model using 2 A6000 GPUs (48 GB each). My implementation is similar to the attribution patching notebook in the demos, but with a sequence length of about 300 tokens. To reduce memory usage, I attempted to cache only activations and gradients layer-wise by setting hooks for each layer independently through multiple forward and backward calls. However, this also leads to Cuda OOM errors. Would appreciate any help or suggestions for the above.
Metadata
Metadata
Assignees
Labels
No labels