CLFeb 17

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

arXiv:2602.15778v11.11 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses efficiency improvements in text evaluation metrics for NLP researchers, though it appears incremental as it builds upon an existing method.

The paper tackles the computational expense and post-processing needs of LLM-as-a-judge methods for text evaluation by introducing *-PLUIE, a task-specific prompting variant of ParaPLUIE, which achieves stronger correlations with human ratings while maintaining low computational cost.

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

View on arXiv PDF

Similar