Travis J. Wheeler

CL
3papers
6citations
Novelty48%
AI Score41

3 Papers

94.2CLJun 2Code
Expert-Aware Refusal Steering

Anna C. Marbut, Daniel R. Olson, Travis J. Wheeler

Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.

LGDec 15, 2022
Reliable Measures of Spread in High Dimensional Latent Spaces

Anna C. Marbut, Katy McKinney-Bock, Travis J. Wheeler

Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.

CLJun 18, 2024
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance

Anna C. Marbut, John W. Chandler, Travis J. Wheeler

It is generally thought that transformer-based large language models benefit from pre-training by learning generic linguistic knowledge that can be focused on a specific task during fine-tuning. However, we propose that much of the benefit from pre-training may be captured by geometric characteristics of the latent space representations, divorced from any specific linguistic knowledge. In this work we explore the relationship between GLUE benchmarking task performance and a variety of measures applied to the latent space resulting from BERT-type contextual language models. We find that there is a strong linear relationship between a measure of quantized cell density and average GLUE performance and that these measures may be predictive of otherwise surprising GLUE performance for several non-standard BERT-type models from the literature. These results may be suggestive of a strategy for decreasing pre-training requirements, wherein model initialization can be informed by the geometric characteristics of the model's latent space.