CLAILGDec 8, 2023

The ICL Consistency Test

arXiv:2312.04945v17 citationsh-index: 27
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of inconsistent performance in prompt-based learning for LLMs, providing a benchmark for researchers, but it is incremental as it builds on existing tasks and metrics.

The paper introduces the ICL consistency test to evaluate the consistency of large language models across different setups using the same data, revealing that all tested models lack robust generalization.

Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT) -- which evaluates how consistent a model makes predictions across many different setups while using the same data. The test is based on different established natural language inference tasks. We provide preprocessed data constituting 96 different 'setups' and a metric that estimates model consistency across these setups. The metric is provided on a fine-grained level to understand what properties of a setup render predictions unstable and on an aggregated level to compare overall model consistency. We conduct an empirical analysis of eight state-of-the-art models, and our consistency metric reveals how all tested LLMs lack robust generalisation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes