SECLPLOct 26, 2022

Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?

arXiv:2210.14699v336 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the challenge of optimizing code assistants for software engineers by revealing that performance is highly sensitive to input variations, though it is incremental as it builds on existing models and benchmarks.

The study investigated how varying input parameters like task description, temperature, and number of generated solutions affects the performance of LLM-based code assistants (Copilot, Codex, StarCoder2) on algorithmic benchmarks, finding that these variations can significantly improve success rates, such as achieving up to 79.27% in one-shot generation compared to default settings of 22.44% for Codex and 31.1% for Copilot.

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes