SECLApr 23, 2025

SCALAR: A Part-of-speech Tagger for Identifiers

arXiv:2504.17038v11 citationsh-index: 24Has CodeICPC
Originality Synthesis-oriented
AI Analysis

This addresses the need for better natural language processing in software engineering by providing a domain-specific tool for developers and researchers, though it appears incremental as it builds on existing methods.

The paper tackles the problem of annotating source code identifier names with part-of-speech tags by introducing SCALAR, a specialized tool that improves upon previous taggers and off-the-shelf alternatives for this task.

The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes