CL LGSep 9, 2025

Instance-level Performance Prediction for Long-form Generation Tasks

Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Omar Alonso, Matthew Lease

arXiv:2509.07309v14.91 citationsh-index: 7

Originality Incremental advance

AI Analysis

This provides a benchmark for researchers working on performance prediction in natural language generation, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of predicting instance-level performance scores for long-form generation tasks with multi-faceted metrics, showing that scores can be effectively predicted across 11 datasets using as few as 16 training examples.

We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

View on arXiv PDF

Similar