Revisit Self-Debugging with Self-Generated Tests for Code Generation
This addresses a practical limitation in real-world code generation for developers using LLMs, but it is incremental as it builds on existing self-debugging concepts.
The paper tackled the problem of limited test availability in self-debugging for code generation by exploring self-generated tests, finding that post-execution self-debugging struggles on basic problems but improves on competitive ones, while in-execution self-debugging enhances code generation by leveraging intermediate states.
Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.