CLAIIRSEJan 29

When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

arXiv:2601.22025v11 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of reliable evaluation for LLM application developers, offering a practical framework to improve prompt iteration, though it is incremental as it builds on existing evaluation methods.

The paper tackles the challenge of evaluating LLM applications by proposing an evaluation-driven workflow and Minimum Viable Evaluation Suite (MVES) to turn stochastic and high-dimensional outputs into a repeatable engineering loop, showing in experiments that generic 'improved' prompts can degrade performance, such as reducing extraction pass rate from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3.

Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes