SEAIAug 17, 2022

CommitBART: A Large Pre-trained Model for GitHub Commits

arXiv:2208.08100v227 citationsh-index: 38Has Code
AI Analysis

This work addresses the need for better tools to comprehend software evolution for developers in the open-source community, though it is incremental as it builds on existing pre-training methods.

The authors tackled the problem of understanding and generating GitHub commit messages by introducing CommitBART, a large pre-trained Transformer model trained on over 7.99 million commits across 7 programming languages, which significantly outperforms previous pre-trained models for code in experiments.

GitHub commits, which record the code changes with natural language messages for description, play a critical role for software developers to comprehend the software evolution. To promote the development of the open-source software community, we collect a commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Furthermore, we unify a ``commit intelligence'' framework with one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBARTsignificantly outperforms previous pre-trained works for code. Further analysis also reveals each pre-training task enhances the model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes