RETROcode: Leveraging a Code Database for Improved Natural Language to Code Generation
This work addresses efficiency and scalability challenges in code generation for developers and researchers, representing an incremental improvement by adapting an existing architecture to a new domain.
The authors tackled the problem of high computational demands and overfitting risks in natural language to code generation by adapting the RETRO architecture to use a large code database as an auxiliary scaling method, resulting in RETROcode outperforming similar-sized models and approaching the effectiveness of the much larger Codex model with a smaller dataset.
As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model size and data volume for performance gains also raises computational demands and risks of overfitting. Addressing these challenges, we present RETROcode, a novel adaptation of the RETRO architecture \cite{RETRO} for sequence-to-sequence models, utilizing a large code database as an auxiliary scaling method. This approach, diverging from simply enlarging model and dataset sizes, allows RETROcode to leverage a vast code database for prediction, enhancing the model's efficiency by integrating extensive memory. Our findings indicate that RETROcode not only outperforms similar-sized traditional architectures on test sets but also approaches the effectiveness of the much larger Codex model, despite being trained from scratch on a substantially smaller dataset.