Xavier Fernandes

CL
h-index27
3papers
176citations
Novelty65%
AI Score33

3 Papers

CRJan 6, 2023
TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

Hojjat Aghakhani, Wei Dai, Andre Manoel et al. · microsoft-research, mit

With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model's training by injecting malicious data. Poisoning attacks could be designed to influence the model's suggestions at run time for chosen contexts, such as inducing the model into suggesting insecure code payloads. To achieve this, prior attacks explicitly inject the insecure code payload into the training data, making the poison data detectable by static analysis tools that can remove such malicious data from the training set. In this work, we demonstrate two novel attacks, COVERT and TROJANPUZZLE, that can bypass static analysis by planting malicious poison data in out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poison data by never explicitly including certain (suspicious) parts of the payload in the poison data, while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust against signature-based dataset-cleansing methods that can filter out suspicious sequences from the training data. Our evaluation against models of two sizes demonstrates that both COVERT and TROJANPUZZLE have significant implications for practitioners when selecting code used to train or tune code-suggestion models.

LGNov 18, 2024
Steering Language Model Refusal with Sparse Autoencoders

Kyle O'Brien, David Majercak, Xavier Fernandes et al.

Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general language model capabilities than previously understood. Our findings reveal important open questions about the nature of safety-relevant features in language models and the feasibility of isolating them for targeted intervention. While SAE-based steering shows promise as a flexible approach to enhancing language model safety, our results highlight the critical need to understand and address the mechanisms behind these capability tradeoffs before such techniques can be practically deployed.

CLNov 6, 2024
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama, Nicholas King et al. · microsoft-research

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.