SE CRAug 31, 2020

A3Ident: A Two-phased Approach to Identify the Leading Authors of Android Apps

Wei Wang, Guozhu Meng, Haoyu Wang, Kai Chen, Weimin Ge, Xiaohong Li

arXiv:2008.13768v15.31 citations

Originality Incremental advance

AI Analysis

This addresses authorship disputes and security issues in Android development, but it is incremental as it builds on existing authorship identification methods with specific adaptations for Android.

The paper tackles the problem of identifying the primary authors of Android apps, which is challenging due to third-party libraries and inherited components, and achieves an average accuracy of 92.5% on standard datasets and 80.4% on obfuscated apps.

Authorship identification is the process of identifying and classifying authors through given codes. Authorship identification can be used in a wide range of software domains, e.g., code authorship disputes, plagiarism detection, exposure of attackers' identity. Besides the inherent challenges from legacy software development, framework programming and crowdsourcing mode in Android raise the difficulties of authorship identification significantly. More specifically, widespread third party libraries and inherited components (e.g., classes, methods, and variables) dilute the primary code within the entire Android app and blur the boundaries of code written by different authors. However, prior research has not well addressed these challenges. To this end, we design a two-phased approach to attribute the primary code of an Android app to the specific developer. In the first phase, we put forward three types of strategies to identify the relationships between Java packages in an app, which consist of context, semantic and structural relationships. A package aggregation algorithm is developed to cluster all packages that are of high probability written by the same authors. In the second phase, we develop three types of features to capture authors' coding habits and code stylometry. Based on that, we generate fingerprints for an author from its developed Android apps and employ several machine learning algorithms for authorship classification. We evaluate our approach in three datasets that contain 15,666 apps from 257 distinct developers and achieve a 92.5% accuracy rate on average. Additionally, we test it on 2,900 obfuscated apps and our approach can classify apps with an accuracy rate of 80.4%.

View on arXiv PDF

Similar