This work was done as the capstone project for BlueDot Impact’s AI Safety Fundamentals - Alignment course, June 2024.

Summary

First, I identify activation vectors related to honesty in an RLHF’d LLM (Llama-2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into (or out of) the model, by use of a loss function based on the cosine similarity of residual stream activations to the vectors. Finally, I compare the results to fine-tuning based on honest or dishonest prompts, and to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity loss had the strongest effect on shifting model output in the intended direction, and showed some resistance to subsequent steering, suggesting the potential utility of this approach as a safety measure.

Introduction

The concept of activation steering/representation engineering is simple, and it is remarkable that it works. First, one identifies an activation pattern in a model (generally in the residual stream input or output) corresponding to a high-level behavior like "sycophancy" or "honesty" by a simple expedient such as running pairs of inputs with and without the behavior through the model and taking the mean of the differences in the pairs' activations. Then one adds the resulting vector, scaled by +/- various coefficients, to the model's activations as it generates new output, and the model gives output that has more or less of the behavior, as one desires. This would seem quite interesting from the perspective of LLM interpretability, and potentially safety.

Beneath the apparent simplicity of activation steering, there are a lot of details and challenges, from deciding on which behavioral dimension to use, to identifying the best way to elicit representations relevant to it in the model, to determining which layers to target for steering, and more. A number of differing approaches having been reported and many more are possible, and I explored many of them before settling on one to pursue more deeply; see this github repo for a longer discussion of this process and associated code.

In this work I extend the activation steering concept by permanently changing the weights of the model via fine-tuning, obviating the need for active steering with every input. Other researchers have independently explored the idea of fine-tuning as a replacement for online steering, but this work is distinctive in targeting the tuning specifically at model parameter activations, rather than the standard method of tuning based on model output deviations from target output. In addition to offering compute savings due to not having to add vectors to every token at inference, it was hypothesized that this approach might make the model more robust in its intended behavior. See this github repo for representation tuning code and methods. Tuned models are available in this HuggingFace repo.

The basic approach I use in the work is as follows:

  1. Identify candidate steering vectors for the behavioral dimension of interest (here, Honesty) via contrastive factual true/false prompts and PCA.
  2. Use visualizations to infer the meaning of the vectors and candidate model layers to target for steering/tuning.
  3. Identify the most effective steering parameters (layers and multipliers) via steering on an evaluation dataset containing contrastive prompts (but no labels).
  4. Fine tune the vectors into or out of the model, targeting the layers identified above, using cosine similarity loss and, separately, fine tune them in using target token loss.
  5. Test the impact of online steering, vector similarity loss tuning, and target token loss tuning on a third dataset of contrasting prompts for quantitative evaluation, and on a small set of more natural, moral questions, designed to offer a qualitative sense of the model’s behavior in more realistic situations.

Results