SEJan 5, 2022

ARCLIN: Automated API Mention Resolution for Unformatted Texts

Yintong Huo, Yuxin Su, Hongming Zhang, Michael R. Lyu

arXiv:2201.01459v22.3Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of extracting API knowledge from unstructured developer discussions to support tasks like API searching, though it is incremental as it builds on existing methods with improvements in accuracy.

The paper tackles the problem of automatically recognizing and linking API mentions in informal technical forum texts, proposing the ARCLIN tool which outperforms previous heuristic-based approaches by 8% on a dataset of 558 mentions from Python libraries.

Online technical forums (e.g., StackOverflow) are popular platforms for developers to discuss technical problems such as how to use specific Application Programming Interface (API), how to solve the programming tasks, or how to fix bugs in their codes. These discussions can often provide auxiliary knowledge of how to use the software that is not covered by the official documents. The automatic extraction of such knowledge will support a set of downstream tasks like API searching or indexing. However, unlike official documentation written by experts, discussions in open forums are made by regular developers who write in short and informal texts, including spelling errors or abbreviations. There are three major challenges for the accurate APIs recognition and linking mentioned APIs from unstructured natural language documents to an entry in the API repository: (1) distinguishing API mentions from common words; (2) identifying API mentions without a fully qualified name; and (3) disambiguating API mentions with similar method names but in a different library. In this paper, to tackle these challenges, we propose an ARCLIN tool, which can effectively distinguish and link APIs without using human annotations. Specifically, we first design an API recognizer to automatically extract API mentions from natural language sentences by a Conditional Random Field (CRF) on the top of a Bi-directional Long Short-Term Memory (Bi-LSTM) module, then we apply a context-aware scoring mechanism to compute the mention-entry similarity for each entry in an API repository. Compared to previous approaches with heuristic rules, our proposed tool without manual inspection outperforms by 8% in a high-quality dataset Py-mention, which contains 558 mentions and 2,830 sentences from five popular Python libraries.

View on arXiv PDF Code

Similar