CLAILGMay 24, 2024

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

UW
arXiv:2405.15750v231 citationsh-index: 9TACL
Originality Incremental advance
AI Analysis

This addresses the challenge of evaluating linguistic generalization in language models for AI and NLP researchers, though it is incremental as it builds on existing training methods.

The paper tackles the problem of measuring language models' ability to generalize linguistically from indirect evidence by introducing Filtered Corpus Training, which trains models on corpora with specific constructions filtered out, and finds that both LSTM and Transformer models perform equally well on generalization tasks, despite transformers having lower perplexity.

This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes