CLJun 29, 2025

Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

arXiv:2506.23146v34.91 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the challenge for practitioners in determining when ICL reliably improves model performance, though it is incremental as it builds on existing ICL evaluation methods.

The paper tackles the problem of unreliable evaluation of in-context learning (ICL) effectiveness in large language models by proposing the Learning-to-Context Slope (LCS) metric, which quantifies ICL effectiveness through loss changes and contextual relevance, showing strong correlation with performance improvements and reliable reflection in data-scarce scenarios.

In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

View on arXiv PDF

Similar