29.4SEMay 6
Agentic Repository Mining: A Multi-Task EvaluationJohannes Härtel
Mining software repositories often requires classifying artifacts like commits, reviews, code lines, or entire repositories into categories. Human labeling is expensive and error-prone; limited context frequently leads to misclassifications or uncertainty in labels. We investigate whether LLM agents that dynamically explore repositories through standard bash commands can match the classification quality of simple LLMs that receive pre-engineered context. Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context-window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.
PLJan 27, 2017
Interconnected Linguistic ArchitectureJohannes Härtel, Lukas Härtel, Ralf Lämmel et al.
The context of the reported research is the documentation of software technologies such as object/relational mappers, web-application frameworks, or code generators. We assume that documentation should model a macroscopic view on usage scenarios of technologies in terms of involved artifacts, leveraged software languages, data flows, conformance relationships, and others. In previous work, we referred to such documentation also as 'linguistic architecture'. The corresponding models may also be referred to as 'megamodels' while adopting this term from the technological space of modeling/model-driven engineering. This work is an inquiry into making such documentation less abstract and more effective by means of connecting (mega)models, systems, and developer experience in several ways. To this end, we adopt an approach that is primarily based on prototyping (i.e., implementa- tion of a megamodeling infrastructure with all conceivable connections) and experimentation with showcases (i.e., documentation of concrete software technologies). The knowledge gained by this research is a notion of interconnected linguistic architecture on the grounds of connecting primary model elements, inferred model elements, static and runtime system artifacts, traceability links, system contexts, knowledge resources, plugged interpretations of model elements, and IDE views. A corresponding suite of aspects of interconnected linguistic architecture is systematically described. As to the grounding of this research, we describe a literature survey which tracks scattered occurrences and thus demonstrates the relevance of the identified aspects of interconnected linguistic architecture. Further, we describe the MegaL/Xtext+IDE infrastructure which realizes interconnected linguistic architecture. The importance of this work lies in providing more formal (ontologically rich, navigable, verifiable) documentation of software technologies helping developers to better understand how to use technologies in new systems (prescriptive mode) or how technologies are used in existing systems (descriptive mode).