SE CLApr 23, 2025

SCALAR: A Part-of-speech Tagger for Identifiers

Christian D. Newman, Brandon Scholten, Sophia Testa, Joshua A. C. Behler, Syreen Banabilah, Michael L. Collard, Michael J. Decker, Mohamed Wiem Mkaouer, Marcos Zampieri, Eman Abdullah AlOmar, Reem Alsuhaibani, Anthony Peruma

arXiv:2504.17038v13.41 citationsh-index: 24Has CodeICPC

Originality Synthesis-oriented

AI Analysis

This addresses the need for better natural language processing in software engineering by providing a domain-specific tool for developers and researchers, though it appears incremental as it builds on existing methods.

The paper tackles the problem of annotating source code identifier names with part-of-speech tags by introducing SCALAR, a specialized tool that improves upon previous taggers and off-the-shelf alternatives for this task.

The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github

View on arXiv PDF

Similar