LGJun 22, 2025

Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings

Lucas Mattioli, Youness Ait Hadichou, Sabrina Chaouche, Martin Gonzalez

arXiv:2506.17989v14.1h-index: 2

Originality Incremental advance

AI Analysis

This addresses a critical issue for machine learning practitioners using text embeddings, highlighting risks in data curation and evaluation, though it is incremental in refining existing embedding methods.

The study tackled the problem of model collapse, where training on uncurated text embeddings leads to predictions converging to a single class, and found that this failure mode consistently occurs, with text embeddings not effectively serving as a curation layer and causing spurious performance correlations.

Training models on uncurated Text Embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as model collapse, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer - and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and spurious Accuracy-on-the-Line correlation. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.

View on arXiv PDF

Similar