When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

arXiv:2601.22025v11 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of reliable evaluation for LLM application developers, offering a practical framework to improve prompt iteration, though it is incremental as it builds on existing evaluation methods.

The paper tackles the challenge of evaluating LLM applications by proposing an evaluation-driven workflow and Minimum Viable Evaluation Suite (MVES) to turn stochastic and high-dimensional outputs into a repeatable engineering loop, showing in experiments that generic 'improved' prompts can degrade performance, such as reducing extraction pass rate from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3.

Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.

View on arXiv PDF

Similar