The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
This addresses the adaptability issue for language models in handling novel tasks, representing a strong specific gain rather than a broad paradigm shift.
The paper tackled the problem of language models struggling with structurally novel tasks in few-shot learning by applying test-time training, achieving up to 6x higher accuracy on ARC and a 7.3 percentage point improvement on BIG-Bench Hard.
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines -- reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.