CL AI PL SEMay 9, 2023

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui

arXiv:2305.06156v222.2138 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for better code understanding and generation tools for developers, though it is incremental as it builds on existing dataset and model approaches.

The authors introduced The Vault, a multilingual dataset of 43 million high-quality code-text pairs, and showed that fine-tuning large language models on it outperforms models trained on other datasets like CodeSearchNet in tasks such as code generation, search, and summarization.

We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.

View on arXiv PDF Code

Similar