LGFeb 5

Chunky Post-Training: Data Driven Failures of Generalization

arXiv:2602.05910v11 citations
AI Analysis

This addresses the issue of unreliable generalization in LLMs for developers and users, though it is incremental as it builds on existing concerns about data biases.

The paper tackles the problem of LLMs learning spurious correlations from post-training data, which leads to unintended behaviors like rejecting true facts due to formatting quirks, and introduces SURF and TURF tools that identify and trace these failures in models such as Claude 4.5 and GPT-5.1.

LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes