LGMay 26

Learn from your own latents and not from tokens: A sample-complexity theory

Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart

Cambridge

arXiv:2605.2773491.2h-index: 53

Predicted impact top 12% in LG · last 90 daysOriginality Highly original

AI Analysis

Provides theoretical justification for latent prediction methods like data2vec and JEPA, showing they can drastically improve data efficiency for hierarchical data, which is relevant for generative models and understanding biological learning.

The paper proves that latent prediction (predicting own latent representations) achieves constant sample complexity in the depth of hierarchical structure, whereas token-level SSL requires exponential samples. This is demonstrated on probabilistic context-free grammars and validated with neural networks and data2vec analysis.

Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their \emph{own} latent representations of related views or masked regions, as in data2vec and JEPA -- an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth $L$. For such data, supervised or token-level SSL require a number of samples \emph{exponential} in $L$ to recover the latent tree; we prove that latent prediction achieves this with a number of samples \emph{constant} in $L$, up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.

View on arXiv PDF

Similar