FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation
This work addresses the challenge of task-agnostic performance estimation for language models, which is incremental as it refines existing metrics by adding complexity.
The paper tackled the problem of existing prompt metrics like perplexity failing to accurately estimate language model performance in complex scenarios like task or domain transfer, and proposed FamiCom, which combines familiarity with task complexity, achieving a 0.85 Spearman's correlation and over 7.0% accuracy improvement in prompt selection.
Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.