LGSEJun 18, 2024

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

arXiv:2406.12334v4113 citations
Originality Incremental advance
AI Analysis

This addresses the challenge for developers integrating LLMs into software by providing tools to improve prompt engineering, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of debugging LLMs' inconsistent behavior across minor prompt variations by introducing sensitivity and consistency metrics for classification tasks, which they empirically compared on text classification to understand failure modes.

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be helpful to guide prompt engineering and obtain LLMs that balance robustness with performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes