Skip to content

wassname/LoRA_are_lie_detectors

Repository files navigation

Adapters are end-to-end probes

Typically, most Language Learning Model (LLM) probes train a linear classifier on the LLM's residual stream or use a sparse autoencoder. However, an alternative approach is to utilize an adapter, such as LoRA. Instead of only training the hidden states, this method involves end-to-end backpropagation. The key questions are: How does this function? How well does it generalize?

Refer to the branches for details on my experiments.

Stylized Facts:

  • Implementing an adapter as an importance matrix in a Sparse Autoencoder (SAE) does not seem beneficial.
  • Utilizing the activations from adapters as counterfactual residual streams does not significantly improve results.
  • The use of Sparse Autoencoders or VQ-VAE (tokenized autoencoders) does not noticeably enhance the outcome in this context (although the VQ-VAE interpretability project appears promising).

Future Work:

  • I've been applying Phi-2 on datasets where it returns incorrect answers. To advance this, I believe a more reliable and natural method to generate and measure deception is necessary.

Related work:

About

Experiment to see if low rank adapters can work as interventions for lie detection on LLM's

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published