IRAILGSEMay 29

MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets

arXiv:2606.07611h-index: 1
Originality Synthesis-oriented
AI Analysis

For researchers in mining software repositories, this provides an improved framework for dataset discoverability and reuse, though it is an incremental extension of prior work.

This paper enhances MSR dataset analysis by enriching metadata, assessing FAIRness, and applying LDA topic modeling, revealing that repository hosting sites and data formats impact citation patterns and usability.

This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes