CLAISEDec 4, 2023

Magicoder: Empowering Code Generation with OSS-Instruct

arXiv:2312.02120v2264 citationsh-index: 10Has CodeICML
Originality Highly original
AI Analysis

This addresses the need for more realistic and controllable training data in code generation for developers and researchers, though it is incremental as it builds on existing data generation methods.

The authors tackled the problem of bias in synthetic instruction data for code generation by introducing OSS-Instruct, a method that uses open-source code snippets to generate diverse data, resulting in Magicoder models that outperform state-of-the-art code models, with MagicoderS-CL-7B surpassing ChatGPT on HumanEval+ (66.5 vs. 65.9 pass@1).

We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes