CLDec 21, 2022

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

arXiv:2212.10769v134 citationsh-index: 53
Originality Incremental advance
AI Analysis

This reveals a critical flaw in benchmarks for compositional generalization, impacting researchers and practitioners in NLP by showing that prior results may be inflated.

The study tackled the problem of overestimating compositional generalization in pretrained models like T5 due to uncontrolled lexical exposure during pretraining, finding that modified evaluations with novel sequences or embeddings led to lower performance, with degradation increasing with more pretraining data.

Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes