CL LG PLAug 5, 2025

More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Yangtian Zi, Harshitha Menon, Arjun Guha

arXiv:2508.03678v13 citationsh-index: 7IJCNLP-AACL

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing LLM prompts for better code generation, particularly in specialized domains, though it is incremental in nature.

The study investigated how prompt specificity affects LLM code generation performance, finding that explicit I/O specifications, edge-case handling, and stepwise breakdowns in prompts significantly improve pass@1 scores across tasks like HumanEval and ParEval.

State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

View on arXiv PDF

Similar