ML LGMay 3, 2024

Position: Understanding LLMs Requires More Than Statistical Generalization

Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár

arXiv:2405.01964v322.824 citationsh-index: 36Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a foundational problem for AI researchers by highlighting limitations in current theoretical frameworks for LLMs, though it is incremental in proposing a perspective shift rather than a new method.

The paper argues that understanding large language models (LLMs) requires more than statistical generalization, as models with similar test loss can exhibit different behaviors due to non-identifiability, supported by mathematical examples and case studies on zero-shot rule extrapolation, in-context learning, and fine-tunability.

The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

View on arXiv PDF Code

Similar