Morescient GAI for Software Engineering (Extended Version)
This work aims to improve the reliability of generative AI in software engineering by addressing a key bottleneck in model training, though it is incremental as it builds on existing GAI approaches.
The paper addresses the limitation of existing LLM-based code models being trained only on software syntax, which reduces their trustworthiness for semantic-dependent tasks, and proposes a vision for developing 'Morescient' GAI models trained on both semantic and static facets of software.
The ability of Generative AI (GAI) technology to automatically check, synthesize and modify software engineering artifacts promises to revolutionize all aspects of software engineering. Using GAI for software engineering tasks is consequently one of the most rapidly expanding fields of software engineering research, with over a hundred LLM-based code models having been published since 2021. However, the overwhelming majority of existing code models share a major weakness - they are exclusively trained on the syntactic facet of software, significantly lowering their trustworthiness in tasks dependent on software semantics. To address this problem, a new class of "Morescient" GAI is needed that is "aware" of (i.e., trained on) both the semantic and static facets of software. This, in turn, will require a new generation of software observation platforms capable of generating large quantities of execution observations in a structured and readily analyzable way. In this paper, we present a vision and roadmap for how such "Morescient" GAI models can be engineered, evolved and disseminated according to the principles of open science.