An Investigation into the Use of Common Libraries in Android Apps
This addresses a specific issue for Android security and analysis researchers by providing a comprehensive library dataset to improve accuracy in tasks like malware detection, though it is incremental as it builds on prior efforts.
The study tackled the problem of inaccurate Android app analyses due to library code noise by harvesting and analyzing common libraries from 1.5 million apps, resulting in the largest dataset of 1,113 common and 240 advertisement libraries.
The packaging model of Android apps requires the entire code necessary for the execution of an app to be shipped into one single apk file. Thus, an analysis of Android apps often visits code which is not part of the functionality delivered by the app. Such code is often contributed by the common libraries which are used pervasively by all apps. Unfortunately, Android analyses, e.g., for piggybacking detection and malware detection, can produce inaccurate results if they do not take into account the case of library code, which constitute noise in app features. Despite some efforts on investigating Android libraries, the momentum of Android research has not yet produced a complete set of common libraries to further support in-depth analysis of Android apps. In this paper, we leverage a dataset of about 1.5 million apps from Google Play to harvest potential common libraries, including advertisement libraries. With several steps of refinements, we finally collect by far the largest set of 1,113 libraries supporting common functionalities and 240 libraries for advertisement. We use the dataset to investigates several aspects of Android libraries, including their popularity and their proportion in Android app code. Based on these datasets, we have further performed several empirical investigations to confirm the motivations behind our work.