CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation
For software engineering researchers, this provides a large-scale, semantically annotated benchmark for structured commit understanding and evaluation, addressing the lack of such resources.
CommitSuite introduces a benchmark of 63,533 CCS-compliant commits from 243 repositories across 7 languages, with AST-level code changes and LLM-assisted semantic annotations. It proposes a reference-free evaluation framework achieving 0.849 Cohen's Kappa agreement with human judgments, enabling reproducible research on commit classification and message generation.
High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.