Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
This work addresses the problem of unreliable model performance in software engineering tasks for developers and researchers, though it is incremental as it extends existing benchmarks.
The study investigated the robustness of large language models in long-context code question answering, finding substantial performance drops under varied input conditions like shuffled options and irrelevant information, with results highlighting limitations in current evaluations.
Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.