Code for paper "Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition". [Paper]
conda create -n fi python=3.12
conda activate fi
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install transformer_lens==2.16.0 transformers==4.51.3
pip install git+https://github.com/callummcdougall/CircuitsVis.git#subdirectory=python
pip install matplotlib plotly
pip install --upgrade nbformat
# to access data_utils.py and patching_utils.py in subdirectories
export PYTHONPATH="$PWD:$PYTHONPATH"
"Off-by-one addition" is the counterfactual task of adding an additional one to the sum of two numbers, e.g., completing the sequence "1+1=3, 2+2=5, 3+3=" with "7".
- We evaluated six recent language models and they all have non-trivial to near-perfect performance on this task.
- See
1_off_by_one_addition_eval/eval.py
for our code to evaluate models on off-by-one addition.results_aggregated.csv
contains all evaluation results.
- We use mechanistic interpretability technique to understand how models manage to do this. We identified three groups of attention heads, connected in a way similar to the induction head mechanism.
- See
2_circuit_discovery/gemma_2_9b.ipynb
for a demo of circuit discovery and result visualization using Gemma-2 (9B).
- We study the effect of FI heads output by patching their outputs to naive prompts (e.g., "2=2, 3=?").
- See
3_circuit_eval_and_analysis/function_vector.ipynb
for a demo of function vector style analysis.
- We design four task pairs, (a) Off-by-k addition, (b) Shifted MMLU, (c) Caesar Cipher, (d) Base-k addition.
- When ablating the identified function induction heads, models struggle to perform the second step in these two step tasks.
- See
4_task_generalization
for code to reproduce the results.