The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse
This work addresses a fundamental limitation in language models for AI researchers, revealing that apparent fixes may not lead to true generalization, making it incremental in understanding model behavior.
The paper investigates the reversal curse in autoregressive language models, showing that objectives with bidirectional supervision improve reversal accuracy but do not create a unified latent representation of facts, instead storing forward and reverse directions as distinct entries.
The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on ``$A > B$'' but failing on ``$B < A$''). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes'' can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.