Adapters are end-to-end probes

Typically, most Language Learning Model (LLM) probes train a linear classifier on the LLM's residual stream or use a sparse autoencoder. However, an alternative approach is to utilize an adapter, such as LoRA. Instead of only training the hidden states, this method involves end-to-end backpropagation. The key questions are: How does this function? How well does it generalize?

Refer to the branches for details on my experiments.

Stylized Facts:

Implementing an adapter as an importance matrix in a Sparse Autoencoder (SAE) does not seem beneficial.
Utilizing the activations from adapters as counterfactual residual streams does not significantly improve results.
The use of Sparse Autoencoders or VQ-VAE (tokenized autoencoders) does not noticeably enhance the outcome in this context (although the VQ-VAE interpretability project appears promising).

Future Work:

I've been applying Phi-2 on datasets where it returns incorrect answers. To advance this, I believe a more reliable and natural method to generate and measure deception is necessary.

Related work:

https://github.com/wassname/discovering_latent_knowledge

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
justfile		justfile
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
research_log.md		research_log.md
tmp.arrow		tmp.arrow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adapters are end-to-end probes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wassname/LoRA_are_lie_detectors

Folders and files

Latest commit

History

Repository files navigation

Adapters are end-to-end probes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages