CLNov 8, 2024

Assessing the Answerability of Queries in Retrieval-Augmented Code Generation

Geonmin Kim, Jaeyeon Kim, Hancheol Park, Wooksu Shin, Tae-Ho Kim

arXiv:2411.05547v21.02 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses the issue of plausible yet incorrect code generation for software developers, but it is incremental as it focuses on evaluation rather than a new generation method.

The study tackles the problem of incorrect code generation in retrieval-augmented code generation by proposing a task to evaluate answerability based on user queries and retrieved APIs, and builds a benchmark dataset called RaCGEval, with baseline models achieving only 46.7% performance, indicating the task is very challenging.

Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance.

View on arXiv PDF

Similar