Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against
This reveals a fundamental limitation in how current models interpret training data, which is incremental as it builds on existing understanding of model behavior.
The paper tackled the problem of language models learning from warning-framed training data, finding that such warnings fail to teach avoidance, with models reproducing flagged content at rates similar to direct exposure (76.7% vs. 83.3%).
Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.