Advancing Language Models for Code-related Tasks
This work addresses challenges in software engineering for developers by providing incremental improvements to existing language models.
The research tackled limitations in language models for complex programming scenarios by improving data quality, enhancing model architecture, and advancing reasoning capabilities, resulting in techniques like CODA, CodeDenoise, LEAM++, muFiX, and Specine to promote practical adoption in software development.
Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.