LG AISep 9, 2025

ACE and Diverse Generalization via Selective Disagreement

Oliver Daniels, Stuart Armstrong, Alexandre Maranhão, Mahirah Fairuz Rahman, Benjamin M. Marlin, Rebecca Gorman

arXiv:2509.07955v14.1h-index: 4

Originality Incremental advance

AI Analysis

This work addresses the challenge of spurious correlations in machine learning, which is a critical issue for improving model robustness and generalization, particularly in domains like language-model alignment, though it is incremental as it builds on prior methods for handling spurious correlations.

The paper tackles the problem of deep neural networks being sensitive to spurious correlations, especially in cases where these correlations are complete and lead to underspecified generalizations, by proposing ACE, a method that learns a set of concepts through self-training to encourage confident and selective disagreement. The result is that ACE matches or outperforms existing methods on complete-spurious correlation benchmarks and achieves competitive performance on a language-model alignment benchmark without access to untrusted measurements.

Deep neural networks are notoriously sensitive to spurious correlations - where a model learns a shortcut that fails out-of-distribution. Existing work on spurious correlations has often focused on incomplete correlations,leveraging access to labeled instances that break the correlation. But in cases where the spurious correlations are complete, the correct generalization is fundamentally \textit{underspecified}. To resolve this underspecification, we propose learning a set of concepts that are consistent with training data but make distinct predictions on a subset of novel unlabeled inputs. Using a self-training approach that encourages \textit{confident} and \textit{selective} disagreement, our method ACE matches or outperforms existing methods on a suite of complete-spurious correlation benchmarks, while remaining robust to incomplete spurious correlations. ACE is also more configurable than prior approaches, allowing for straight-forward encoding of prior knowledge and principled unsupervised model selection. In an early application to language-model alignment, we find that ACE achieves competitive performance on the measurement tampering detection benchmark \textit{without} access to untrusted measurements. While still subject to important limitations, ACE represents significant progress towards overcoming underspecification.

View on arXiv PDF

Similar