CLAug 20, 2023
Steering Language Models With Activation EngineeringAlexander Matt Turner, Lisa Thiergart, Gavin Leech et al.
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
MLFeb 8, 2023
Decision trees compensate for model misspecificationHugh Panton, Gavin Leech, Laurence Aitchison
The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification. We present two methods for robust generalized linear models (GLMs) addressing the composite and mixed response scenarios.
LGJul 17, 2024
Questionable practices in machine learningGavin Leech, Juan J. Vazquez, Niclas Kupper et al.
Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of large language models (LLMs) on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.
LGFeb 12
Soft Contamination Means Benchmarks Test Shallow GeneralizationAri Spiesberger, Juan J. Vazquez, Nicky Pochinkov et al.
If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.
AIFeb 6, 2024
Ten Hard Problems in Artificial Intelligence We Must Get RightGavin Leech, Simson Garfinkel, Misha Yagudin et al.
We explore the AI2050 "hard problems" that block the promise of AI and cause AI risks: (1) developing general capabilities of the systems; (2) assuring the performance of AI systems and their training processes; (3) aligning system goals with human goals; (4) enabling great applications of AI in real life; (5) addressing economic disruptions; (6) ensuring the participation of all; (7) at the same time ensuring socially responsible deployment; (8) addressing any geopolitical disruptions that AI causes; (9) promoting sound governance of the technology; and (10) managing the philosophical disruptions for humans living in the age of AI. For each problem, we outline the area, identify significant recent work, and suggest ways forward. [Note: this paper reviews literature through January 2023.]
LGMay 18, 2023
Massively Parallel Reweighted Wake-SleepThomas Heap, Gavin Leech, Laurence Aitchison
Reweighted wake-sleep (RWS) is a machine learning method for performing Bayesian inference in a very general class of models. RWS draws $K$ samples from an underlying approximate posterior, then uses importance weighting to provide a better estimate of the true posterior. RWS then updates its approximate posterior towards the importance-weighted estimate of the true posterior. However, recent work [Chattergee and Diaconis, 2018] indicates that the number of samples required for effective importance weighting is exponential in the number of latent variables. Attaining such a large number of importance samples is intractable in all but the smallest models. Here, we develop massively parallel RWS, which circumvents this issue by drawing $K$ samples of all $n$ latent variables, and individually reasoning about all $K^n$ possible combinations of samples. While reasoning about $K^n$ combinations might seem intractable, the required computations can be performed in polynomial time by exploiting conditional independencies in the generative model. We show considerable improvements over standard "global" RWS, which draws $K$ samples from the full joint.
LGSep 24, 2020
Legally grounded fairness objectivesDylan Holden-Sim, Gavin Leech, Laurence Aitchison
Recent work has identified a number of formally incompatible operational measures for the unfairness of a machine learning (ML) system. As these measures all capture intuitively desirable aspects of a fair system, choosing "the one true" measure is not possible, and instead a reasonable approach is to minimize a weighted combination of measures. However, this simply raises the question of how to choose the weights. Here, we formulate Legally Grounded Fairness Objectives (LGFO), which uses signals from the legal system to non-arbitrarily measure the social cost of a specific degree of unfairness. The LGFO is the expected damages under a putative lawsuit that might be awarded to those who were wrongly classified, in the sense that the ML system made a decision different to that which would have be made under the court's preferred measure. Notably, the two quantities necessary to compute the LGFO, the court's preferences about fairness measures, and the expected damages, are unknown but well-defined, and can be estimated by legal advice. Further, as the damages awarded by the legal system are designed to measure and compensate for the harm caused to an individual by an unfair classification, the LGFO aligns closely with society's estimate of the social cost.
APJul 27, 2020
How Robust are the Estimated Effects of Nonpharmaceutical Interventions against COVID-19?Mrinank Sharma, Sören Mindermann, Jan Markus Brauner et al.
To what extent are effectiveness estimates of nonpharmaceutical interventions (NPIs) against COVID-19 influenced by the assumptions our models make? To answer this question, we investigate 2 state-of-the-art NPI effectiveness models and propose 6 variants that make different structural assumptions. In particular, we investigate how well NPI effectiveness estimates generalise to unseen countries, and their sensitivity to unobserved factors. Models that account for noise in disease transmission compare favourably. We further evaluate how robust estimates are to different choices of epidemiological parameters and data. Focusing on models that assume transmission noise, we find that previously published results are remarkably robust across these variables. Finally, we mathematically ground the interpretation of NPI effectiveness estimates when certain common assumptions do not hold.