AxBench _{by pyvene}

AxBench is a a scalable benchmark that evaluates interpretability techniques on two axes: concept detection and model steering.

🤗 HuggingFace: AxBench Collections
Tutorial of using our dictionary via pyvene

Related papers

🏆 Rank-1 steering leaderboard

📢 Please open a PR to enter the leaderboard.

Method	2B L10	2B L20	9B L20	9B L31	Avg
HyperSteer [Sun et al., 2025]	-	0.742	1.091	-	0.917
Prompt	0.698	0.731	1.075	1.072	0.894
RePS [Wu et. al., 2025]	0.756	0.606	0.892	0.624	0.720
ReFT-r1	0.633	0.509	0.630	0.401	0.543
SAE (filtered) [Arad et. al., 2025]	-	-	0.546	0.470	0.508
DiffMean	0.297	0.178	0.322	0.158	0.239
SAE	0.177	0.151	0.191	0.140	0.165
SAE-A	0.166	0.132	0.186	0.143	0.157
LAT	0.117	0.130	0.127	0.134	0.127
PCA	0.107	0.083	0.128	0.104	0.105
Probe	0.095	0.091	0.108	0.099	0.098

Highlights

Scalabale evaluation harness: Framework for generating synthetic training + eval data from concept lists (e.g. GemmaScope SAE labels).
Comprehensive implementations: 10+ interpretability methods evaluated, along with finetuning and prompting baselines.
16K concept training data: Full-scale datasets for supervised dictionary learning (SDL).
Two pretrained SDL models: Drop-in replacements for standard SAEs.
LLM-in-the-loop training: Generate your own datasets for less than $0.01 per concept.

Additional experiments

We include exploratory notebooks under axbench/examples, such as:

Experiment	Description
`basics.ipynb`	Analyzes basic geometry of learned dictionaries.
`subspace_gazer.ipynb`	Visualizes learned subspaces.
`lang>subspace.ipynb`	Fine-tunes a hyper-network to map natural language to subspaces or steering vectors.
`platonic.ipynb`	Explores the platonic representation hypothesis in subspace learning.

Instructions for AxBenching your methods

Installation

We highly suggest using uv for your Python virtual environment, but you can use any venv manager.

git clone [email protected]:stanfordnlp/axbench.git
cd axbench
uv sync # if using uv

Set up your API keys for OpenAI and Neuronpedia:

import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"
os.environ["NP_API_KEY"] = "your_neuronpedia_api_key_here"

Download the necessary datasets to axbench/data:

uv run axbench/data/download-seed-sentences.py
cd axbench/data
bash download-2b.sh
bash download-9b.sh
bash download-alpaca.sh

Try a simple demo.

To run a complete demo with a single config file:

bash axbench/demo/demo.sh

To run a complete demo for HyperSteer

bash axbench/demo/hypersteer_demo.sh

Data generation

(If using our pre-generated data, you can skip this.)

Generate training data:

uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml --mode training --dump_dir axbench/demo

Generate inference data:

uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml --mode latent --dump_dir axbench/demo

Generate preference-based training data:

uv run axbench/scripts/generate.py --config axbench/demo/sweep/simple.yaml \
  --mode dpo_training --dump_dir axbench/demo \
  --model_name google/gemma-2-2b-it \
  --inference_batch_size 64

To modify the data generation process, edit simple.yaml.

Training

Train and save your methods:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo

(Replace $gpu_count with the number of GPUs to use.)

For additional config:

torchrun --nproc_per_node=$gpu_count axbench/scripts/train.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --overwrite_data_dir axbench/concept500/prod_2b_l10_v1/generate

where --dump_dir is the output directory, and --overwrite_data_dir is where the training data resides. You might overwrite other parameters as --layer 10 for customized tuning.

Inference

Concept detection

Run inference:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo \
  --mode latent

For additional config using custom directories:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
  --mode latent

Imbalanced concept detection

For real-world scenarios with fewer than 1% positive examples, we upsample negatives (100:1) and re-evaluate. Use:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
  --mode latent_imbalance

Model steering

For steering experiments:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo \
  --mode steering

Or a custom run:

uv run torchrun --nproc_per_node=$gpu_count axbench/scripts/inference.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --overwrite_metadata_dir axbench/concept500/prod_2b_l10_v1/generate \
  --overwrite_inference_data_dir axbench/concept500/prod_2b_l10_v1/inference \
  --mode steering

Evaluation

Concept detection

To evaluate concept detection results:

uv run axbench/scripts/evaluate.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo \
  --mode latent

Enable wandb logging:

uv run axbench/scripts/evaluate.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo \
  --mode latent \
  --report_to wandb \
  --wandb_entity "your_wandb_entity"

Or evaluate using your custom config:

uv run axbench/scripts/evaluate.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --mode latent

Model steering on evaluation set

To evaluate steering:

uv run axbench/scripts/evaluate.py \
  --config axbench/demo/sweep/simple.yaml \
  --dump_dir axbench/demo \
  --mode steering

Or a custom config:

uv run axbench/scripts/evaluate.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --mode steering

Model steering on test set

Note that the commend above is for evaluation. We select the best factor by using the results on the evaluation set. After that you will do the evaluation on the test set.

uv run axbench/scripts/evaluate.py \
  --config axbench/sweep/wuzhengx/2b/l10/no_grad.yaml \
  --dump_dir axbench/results/prod_2b_l10_concept500_no_grad \
  --mode steering_test

Analyses

Once you finished evaluation, you can do the analyses with our provided notebook in axbench/scripts/analyses.ipynb. All of our results in the paper are produced by this notebook.

You need to point revelant directories to your own results by modifying the notebook. If you introduce new models, datasets, or new evaluation metrics, you can add your own analysis by following the notebook.

Reproducing our results.

Please see axbench/experiment_commands.txt for detailed commands and configurations.

Feature suppression experiments

In our recent paper release, we introduce feature suppresion evaluations. Please see axbench/sweep/wuzhengx/reps/README.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 429 Commits
axbench		axbench
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
hypersteer_requirement.txt		hypersteer_requirement.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AxBench _{by pyvene}

Related papers

🏆 Rank-1 steering leaderboard

Highlights

Additional experiments

Instructions for AxBenching your methods

Installation

Try a simple demo.

Data generation

Training

Inference

Concept detection

Imbalanced concept detection

Model steering

Evaluation

Concept detection

Model steering on evaluation set

Model steering on test set

Analyses

Reproducing our results.

Feature suppression experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

License

stanfordnlp/axbench

Folders and files

Latest commit

History

Repository files navigation

AxBench by pyvene

Related papers

🏆 Rank-1 steering leaderboard

Highlights

Additional experiments

Instructions for AxBenching your methods

Installation

Try a simple demo.

Data generation

Training

Inference

Concept detection

Imbalanced concept detection

Model steering

Evaluation

Concept detection

Model steering on evaluation set

Model steering on test set

Analyses

Reproducing our results.

Feature suppression experiments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

AxBench _{by pyvene}

Packages