Sen Chen

SE
h-index44
30papers
1,941citations
Novelty45%
AI Score51

30 Papers

SEAug 9, 2023Code
A Comprehensive Empirical Study of Bugs in Open-Source Federated Learning Frameworks

Weijie Shao, Yuyang Gao, Fu Song et al.

Federated learning (FL) is a distributed machine learning (ML) paradigm, allowing multiple clients to collaboratively train shared machine learning (ML) models without exposing clients' data privacy. It has gained substantial popularity in recent years, especially since the enforcement of data protection laws and regulations in many countries. To foster the application of FL, a variety of FL frameworks have been proposed, allowing non-experts to easily train ML models. As a result, understanding bugs in FL frameworks is critical for facilitating the development of better FL frameworks and potentially encouraging the development of bug detection, localization and repair tools. Thus, we conduct the first empirical study to comprehensively collect, taxonomize, and characterize bugs in FL frameworks. Specifically, we manually collect and classify 1,119 bugs from all the 676 closed issues and 514 merged pull requests in 17 popular and representative open-source FL frameworks on GitHub. We propose a classification of those bugs into 12 bug symptoms, 12 root causes, and 18 fix patterns. We also study their correlations and distributions on 23 functionalities. We identify nine major findings from our study, discuss their implications and future research directions based on our findings.

SDJun 7, 2022
Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition

Guangke Chen, Zhe Zhao, Fu Song et al.

Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides lots of useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform SPEAKERGUARD to foster further research.

SEJul 24, 2024Code
PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software

Kaixuan Li, Jian Zhang, Sen Chen et al.

Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the **initial retrieval** phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the **re-ranking** phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.

SDJun 7, 2022
AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems

Guangke Chen, Zhe Zhao, Fu Song et al.

Recent work has illuminated the vulnerability of speaker recognition systems (SRSs) against adversarial attacks, raising significant security concerns in deploying SRSs. However, they considered only a few settings (e.g., some combinations of source and target speakers), leaving many interesting and important settings in real-world attack scenarios alone. In this work, we present AS2T, the first attack in this domain which covers all the settings, thus allows the adversary to craft adversarial voices using arbitrary source and target speakers for any of three main recognition tasks. Since none of the existing loss functions can be applied to all the settings, we explore many candidate loss functions for each setting including the existing and newly designed ones. We thoroughly evaluate their efficacy and find that some existing loss functions are suboptimal. Then, to improve the robustness of AS2T towards practical over-the-air attack, we study the possible distortions occurred in over-the-air transmission, utilize different transformation functions with different parameters to model those distortions, and incorporate them into the generation of adversarial voices. Our simulated over-the-air evaluation validates the effectiveness of our solution in producing robust adversarial voices which remain effective under various hardware devices and various acoustic environments with different reverberation, ambient noises, and noise levels. Finally, we leverage AS2T to perform thus far the largest-scale evaluation to understand transferability among 14 diverse SRSs. The transferability analysis provides many interesting and useful insights which challenge several findings and conclusion drawn in previous works in the image domain. Our study also sheds light on future directions of adversarial attacks in the speaker recognition domain.

SEApr 21Code
DeepFWI: Identifying Bug-Sensitive Warnings with Multi-Modal Code-Warning Semantics

Han Liu, Jian Zhang, Cen Zhang et al.

Static analysis tools have evolved over time to assist in detecting bugs. However, the excessive false warnings can impede developers' productivity and confidence in the tools. Previous research efforts have explored learning-based approaches to identify bug warnings. Nevertheless, their coarse granularity, focusing on either long-term warnings or function-level alerts, is insensitive to individual bugs. Also, they rely on manually crafted features or solely on source code semantics, which is inadequate for effective learning. In this paper, we propose DeepFWI, a learning-based approach that identifies bug-sensitive warnings at a fine-grained granularity. Specifically, we design a novel LSTM-based model that captures multi-modal semantics of source code and warnings from automated static analysis tools (ASATs) and highlights their correlations with cross-attention. To tackle the data scarcity of training and evaluation, we collected a large-scale dataset of 280,273 warnings. We conducted extensive experiments on the dataset to evaluate DeepFWI. The experimental results demonstrate the effectiveness of our approach, with an F1-score 67.06% for confirming true warnings in a finer-grained manner, significantly outperforming all baselines. Additionally, to validate the practicality of DeepFWI from the perspective of developers, we applied DeepFWI to four popular open-source projects. Our approach filtered out the vast majority of warnings, while still successfully surfacing 25 true bug-related warnings that were confirmed through manual analysis.

CVApr 27, 2022
Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Sen Chen, Zhilei Liu, Jiaxing Liu et al.

Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles. In this study, we propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to promote the relationship learning of cross-modal features. In addition, our proposed method uses both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can guide mouth movements more accurately. Because speech is highly correlated with speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy.

SEDec 20, 2019Code
CORE: Automating Review Recommendation for Code Changes

JingKai Siow, Cuiyun Gao, Lingling Fan et al.

Code review is a common process that is used by developers, in which a reviewer provides useful comments or points out defects in the submitted source code changes via pull request. Code review has been widely used for both industry and open-source projects due to its capacity in early defect identification, project maintenance, and code improvement. With rapid updates on project developments, code review becomes a non-trivial and labor-intensive task for reviewers. Thus, an automated code review engine can be beneficial and useful for project development in practice. Although there exist prior studies on automating the code review process by adopting static analysis tools or deep learning techniques, they often require external sources such as partial or full source code for accurate review suggestion. In this paper, we aim at automating the code review process only based on code changes and the corresponding reviews but with better performance. The hinge of accurate code review suggestion is to learn good representations for both code changes and reviews. To achieve this with limited source, we design a multi-level embedding (i.e., word embedding and character embedding) approach to represent the semantics provided by code changes and reviews. The embeddings are then well trained through a proposed attentional deep learning model, as a whole named CORE. We evaluate the effectiveness of CORE on code changes and reviews collected from 19 popular Java projects hosted on Github. Experimental results show that our model CORE can achieve significantly better performance than the state-of-the-art model (DeepMem), with an increase of 131.03% in terms of Recall@10 and 150.69% in terms of Mean Reciprocal Rank. Qualitative general word analysis among project developers also demonstrates the performance of CORE in automating code review.

ASNov 3, 2019Code
Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

Guangke Chen, Sen Chen, Lingling Fan et al.

Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical blackbox setting. For this purpose, we propose an adversarial attack, named FAKEBOB, to craft adversarial samples. Specifically, we formulate the adversarial sample generation as an optimization problem, incorporated with the confidence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. We demonstrate that FAKEBOB achieves 99% targeted attack success rate on both open-source and commercial systems. We further demonstrate that FAKEBOB is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover, we have conducted a human study which reveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against FAKEBOB, which calls for more effective defense methods. We highlight that our study peeks into the security implications of adversarial attacks on SRSs, and realistically fosters to improve the security robustness of SRSs.

SEJan 22, 2018Code
Large-Scale Analysis of Framework-Specific Exceptions in Android Apps

Lingling Fan, Ting Su, Sen Chen et al.

Mobile apps have become ubiquitous. For app developers, it is a key priority to ensure their apps' correctness and reliability. However, many apps still suffer from occasional to frequent crashes, weakening their competitive edge. Large-scale, deep analyses of the characteristics of real-world app crashes can provide useful insights to guide developers, or help improve testing and analysis tools. However, such studies do not exist -- this paper fills this gap. Over a four-month long effort, we have collected 16,245 unique exception traces from 2,486 open-source Android apps, and observed that framework-specific exceptions account for the majority of these crashes. We then extensively investigated the 8,243 framework-specific exceptions (which took six person-months): (1) identifying their characteristics (e.g., manifestation locations, common fault categories), (2) evaluating their manifestation via state-of-the-art bug detection techniques, and (3) reviewing their fixes. Besides the insights they provide, these findings motivate and enable follow-up research on mobile apps, such as bug detection, fault localization and patch generation. In addition, to demonstrate the utility of our findings, we have optimized Stoat, a dynamic testing tool, and implemented ExLocator, an exception localization tool, for Android apps. Stoat is able to quickly uncover three previously-unknown, confirmed/fixed crashes in Gmail and Google+; ExLocator is capable of precisely locating the root causes of identified exceptions in real-world apps. Our substantial dataset is made publicly available to share with and benefit the community.

SEMar 24
A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

Jun-Peng Zhu, Qizhi Wang, Yulong Zhai et al.

Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.

CRNov 3, 2024
Large Language Model Supply Chain: Open Problems From the Security Perspective

Qiang Hu, Xiaofei Xie, Sen Chen et al.

Large Language Model (LLM) is changing the software development paradigm and has gained huge attention from both academia and industry. Researchers and developers collaboratively explore how to leverage the powerful problem-solving ability of LLMs for specific domain tasks. Due to the wide usage of LLM-based applications, e.g., ChatGPT, multiple works have been proposed to ensure the security of LLM systems. However, a comprehensive understanding of the entire processes of LLM system construction (the LLM supply chain) is crucial but relevant works are limited. More importantly, the security issues hidden in the LLM SC which could highly impact the reliable usage of LLMs are lack of exploration. Existing works mainly focus on assuring the quality of LLM from the model level, security assurance for the entire LLM SC is ignored. In this work, we take the first step to discuss the potential security risks in each component as well as the integration between components of LLM SC. We summarize 12 security-related risks and provide promising guidance to help build safer LLM systems. We hope our work can facilitate the evolution of artificial general intelligence with secure LLM ecosystems.

AINov 20, 2025
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

Sen Chen, Tong Zhao, Yi Bin et al.

Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.

SEJan 11, 2022
Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the NPM Ecosystem

Chengwei Liu, Sen Chen, Lingling Fan et al.

Third-party libraries with rich functionalities facilitate the fast development of Node.js software, but also bring new security threats that vulnerabilities could be introduced through dependencies. In particular, the threats could be excessively amplified by transitive dependencies. Existing research either considers direct dependencies or reasoning transitive dependencies based on reachability analysis, which neglects the NPM-specific dependency resolution rules, resulting in wrongly resolved dependencies. Consequently, further fine-grained analysis, such as vulnerability propagation and their evolution in dependencies, cannot be carried out precisely at a large scale, as well as deriving ecosystem-wide solutions for vulnerabilities in dependencies. To fill this gap, we propose a knowledge graph-based dependency resolution, which resolves the dependency relations of dependencies as trees (i.e., dependency trees), and investigates the security threats from vulnerabilities in dependency trees at a large scale. We first construct a complete dependency-vulnerability knowledge graph (DVGraph) that captures the whole NPM ecosystem (over 10 million library versions and 60 million well-resolved dependency relations). Based on it, we propose DTResolver to statically and precisely resolve dependency trees, as well as transitive vulnerability propagation paths, by considering the official dependency resolution rules. Based on that, we carry out an ecosystem-wide empirical study on vulnerability propagation and its evolution in dependency trees. Our study unveils lots of useful findings, and we further discuss the lessons learned and solutions for different stakeholders to mitigate the vulnerability impact in NPM. For example, we implement a dependency tree based vulnerability remediation method (DTReme) for NPM packages, and receive much better performance than the official tool (npm audit fix).

CVOct 19, 2021
Talking Head Generation with Audio and Speech Related Facial Action Units

Sen Chen, Zhilei Liu, Jiaxing Liu et al.

The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this paper, we propose a novel recurrent generative network that uses both audio and speech-related facial action units (AUs) as the driving information. AU information related to the mouth can guide the movement of the mouth more accurately. Since speech is highly correlated with speech-related AUs, we propose an Audio-to-AU module in our system to predict the speech-related AU information from speech. In addition, we use AU classifier to ensure that the generated images contain correct AU information. Frame discriminator is also constructed for adversarial training to improve the realism of the generated face. We verify the effectiveness of our model on the GRID dataset and TCD-TIMIT dataset. We also conduct an ablation study to verify the contribution of each component in our model. Quantitative and qualitative experiments demonstrate that our method outperforms existing methods in both image quality and lip-sync accuracy.

CRSep 4, 2021
SEC4SR: A Security Analysis Platform for Speaker Recognition

Guangke Chen, Zhe Zhao, Fu Song et al.

Adversarial attacks have been expanded to speaker recognition (SR). However, existing attacks are often assessed using different SR models, recognition tasks and datasets, and only few adversarial defenses borrowed from computer vision are considered. Yet,these defenses have not been thoroughly evaluated against adaptive attacks. Thus, there is still a lack of quantitative understanding about the strengths and limitations of adversarial attacks and defenses. More effective defenses are also required for securing SR systems. To bridge this gap, we present SEC4SR, the first platform enabling researchers to systematically and comprehensively evaluate adversarial attacks and defenses in SR. SEC4SR incorporates 4 white-box and 2 black-box attacks, 24 defenses including our novel feature-level transformations. It also contains techniques for mounting adaptive attacks. Using SEC4SR, we conduct thus far the largest-scale empirical study on adversarial attacks and defenses in SR, involving 23 defenses, 15 attacks and 4 attack settings. Our study provides lots of useful findings that may advance future research: such as (1) all the transformations slightly degrade accuracy on benign examples and their effectiveness vary with attacks; (2) most transformations become less effective under adaptive attacks, but some transformations become more effective; (3) few transformations combined with adversarial training yield stronger defenses over some but not all attacks, while our feature-level transformation combined with adversarial training yields the strongest defense over all the attacks. Extensive experiments demonstrate capabilities and advantages of SEC4SR which can benefit future research in SR.

SEAug 9, 2021
Research on Third-Party Libraries in AndroidApps: A Taxonomy and Systematic LiteratureReview

Xian Zhan, Tianming Liu, Lingling Fan et al.

Third-party libraries (TPLs) have been widely used in mobile apps, which play an essential part in the entire Android ecosystem. However, TPL is a double-edged sword. On the one hand, it can ease the development of mobile apps. On the other hand, it also brings security risks such as privacy leaks or increased attack surfaces (e.g., by introducing over-privileged permissions) to mobile apps. Although there are already many studies for characterizing third-party libraries, including automated detection, security and privacy analysis of TPLs, TPL attributes analysis, etc., what strikes us odd is that there is no systematic study to summarize those studies' endeavors. To this end, we conduct the first systematic literature review on Android TPL-related research. Following a well-defined systematic literature review protocol, we collected 74 primary research papers closely related to the Android third-party library from 2012 to 2020. After carefully examining these studies, we designed a taxonomy of TPL-related research studies and conducted a systematic study to summarize current solutions, limitations, challenges and possible implications of new research directions related to third-party library analysis. We hope that these contributions can give readers a clear overview of existing TPL-related studies and inspire them to go beyond the current status quo by advancing the discipline with innovative approaches.

SEFeb 16, 2021
ATVHunter: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications

Xian Zhan, Lingling Fan, Sen Chen et al.

We propose a system, named ATVHunter, which can pinpoint the precise vulnerable in-app TPL versions and provide detailed information about the vulnerabilities and TPLs. We propose a two-phase detection approach to identify specific TPL versions. Specifically, we extract the Control Flow Graphs as the coarse-grained feature to match potential TPLs in the pre-defined TPL database, and then extract opcode in each basic block of CFG as the fine-grained feature to identify the exact TPL versions. We build a comprehensive TPL database (189,545 unique TPLs with 3,006,676 versions) as the reference database. Meanwhile, to identify the vulnerable in-app TPL versions, we also construct a comprehensive and known vulnerable TPL database containing 1,180 CVEs and 224 security bugs. Experimental results show ATVHunter outperforms state-of-the-art TPL detection tools, achieving 90.55% precision and 88.79% recall with high efficiency, and is also resilient to widely-used obfuscation techniques and scalable for large-scale TPL detection. Furthermore, to investigate the ecosystem of the vulnerable TPLs used by apps, we exploit ATVHunter to conduct a large-scale analysis on 104,446 apps and find that 9,050 apps include vulnerable TPL versions with 53,337 vulnerabilities and 7,480 security bugs, most of which are with high risks and are not recognized by app developers.

CRNov 10, 2020
SeqMobile: A Sequence Based Efficient Android Malware Detection System Using RNN on Mobile Devices

Ruitao Feng, Jing Qiang Lim, Sen Chen et al.

With the proliferation of Android malware, the demand for an effective and efficient malware detection system is on the rise. The existing device-end learning based solutions tend to extract limited syntax features (e.g., permissions and API calls) to meet a certain time constraint of mobile devices. However, syntax features lack the semantics which can represent the potential malicious behaviors and further result in more robust model with high accuracy for malware detection. In this paper, we propose an efficient Android malware detection system, named SeqMobile, which adopts behavior-based sequence features and leverages customized deep neural networks on mobile devices instead of the server. Different from the traditional sequence-based approaches on server, to meet the performance demand, SeqMobile accepts three effective performance optimization methods to reduce the time cost. To evaluate the effectiveness and efficiency of our system, we conduct experiments from the following aspects 1) the detection accuracy of different recurrent neural networks; 2) the feature extraction performance on different mobile devices, 3) the detection accuracy and prediction time cost of different sequence lengths. The results unveil that SeqMobile can effectively detect malware with high accuracy. Moreover, our performance optimization methods have proven to improve the performance of training and prediction by at least twofold. Additionally, to discover the potential performance optimization from the SOTA TensorFlow model optimization toolkit for our approach, we also provide an evaluation on the toolkit, which can serve as a guidance for other systems leveraging on sequence-based learning approach. Overall, we conclude that our sequence-based approach, together with our performance optimization methods, enable us to detect malware under the performance demands of mobile devices.

SEAug 13, 2020
An Empirical Evaluation of GDPR Compliance Violations in Android mHealth Apps

Ming Fan, Le Yu, Sen Chen et al.

The purpose of the General Data Protection Regulation (GDPR) is to provide improved privacy protection. If an app controls personal data from users, it needs to be compliant with GDPR. However, GDPR lists general rules rather than exact step-by-step guidelines about how to develop an app that fulfills the requirements. Therefore, there may exist GDPR compliance violations in existing apps, which would pose severe privacy threats to app users. In this paper, we take mobile health applications (mHealth apps) as a peephole to examine the status quo of GDPR compliance in Android apps. We first propose an automated system, named \mytool, to bridge the semantic gap between the general rules of GDPR and the app implementations by identifying the data practices declared in the app privacy policy and the data relevant behaviors in the app code. Then, based on \mytool, we detect three kinds of GDPR compliance violations, including the incompleteness of privacy policy, the inconsistency of data collections, and the insecurity of data transmission. We perform an empirical evaluation of 796 mHealth apps. The results reveal that 189 (23.7\%) of them do not provide complete privacy policies. Moreover, 59 apps collect sensitive data through different measures, but 46 (77.9\%) of them contain at least one inconsistent collection behavior. Even worse, among the 59 apps, only 8 apps try to ensure the transmission security of collected data. However, all of them contain at least one encryption or SSL misuse. Our work exposes severe privacy issues to raise awareness of privacy protection for app users and developers.

CRMay 11, 2020
A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices

Ruitao Feng, Sen Chen, Xiaofei Xie et al.

Currently, Android malware detection is mostly performed on server side against the increasing number of malware. Powerful computing resource provides more exhaustive protection for app markets than maintaining detection by a single user. However, apart from the applications provided by the official market, apps from unofficial markets and third-party resources are always causing serious security threats to end-users. Meanwhile, it is a time-consuming task if the app is downloaded first and then uploaded to the server side for detection, because the network transmission has a lot of overhead. In addition, the uploading process also suffers from the security threats of attackers. Consequently, a last line of defense on mobile devices is necessary and much-needed. In this paper, we propose an effective Android malware detection system, MobiTive, leveraging customized deep neural networks to provide a real-time and responsive detection environment on mobile devices. MobiTive is a preinstalled solution rather than an app scanning and monitoring engine using after installation, which is more practical and secure. Original deep learning models cannot be directly deployed and executed on mobile devices due to various performance limitations, such as computation power, memory size, and energy. Therefore, we evaluate and investigate the following key points:(1) the performance of different feature extraction methods based on source code or binary code;(2) the performance of different feature type selections for deep learning on mobile devices;(3) the detection accuracy of different deep neural networks on mobile devices;(4) the real-time detection performance and accuracy on different mobile devices;(5) the potential based on the evolution trend of mobile devices' specifications; and finally we further propose a practical solution (MobiTive) to detect Android malware on mobile devices.

CRApr 24, 2020
Why an Android App is Classified as Malware? Towards Malware Classification Interpretation

Bozhi Wu, Sen Chen, Cuiyun Gao et al.

Machine learning (ML) based approach is considered as one of the most promising techniques for Android malware detection and has achieved high accuracy by leveraging commonly-used features. In practice, most of the ML classifications only provide a binary label to mobile users and app security analysts. However, stakeholders are more interested in the reason why apps are classified as malicious in both academia and industry. This belongs to the research area of interpretable ML but in a specific research domain (i.e., mobile malware detection). Although several interpretable ML methods have been exhibited to explain the final classification results in many cutting-edge Artificial Intelligent (AI) based research fields, till now, there is no study interpreting why an app is classified as malware or unveiling the domain-specific challenges. In this paper, to fill this gap, we propose a novel and interpretable ML-based approach (named XMal) to classify malware with high accuracy and explain the classification result meanwhile. (1) The first classification phase of XMal hinges multi-layer perceptron (MLP) and attention mechanism, and also pinpoints the key features most related to the classification result. (2) The second interpreting phase aims at automatically producing neural language descriptions to interpret the core malicious behaviors within apps. We evaluate the behavior description results by comparing with the existing interpretable ML-based methods (i.e., Drebin and LIME) to demonstrate the effectiveness of XMal. We find that XMal is able to reveal the malicious behaviors more accurately. Additionally, our experiments show that XMal can also interpret the reason why some samples are misclassified by ML classifiers. Our study peeks into the interpretable ML through the research of Android malware detection and analysis.

CRApr 15, 2020
Advanced Evasion Attacks and Mitigations on Practical ML-Based Phishing Website Classifiers

Yusi Lei, Sen Chen, Lingling Fan et al.

Machine learning (ML) based approaches have been the mainstream solution for anti-phishing detection. When they are deployed on the client-side, ML-based classifiers are vulnerable to evasion attacks. However, such potential threats have received relatively little attention because existing attacks destruct the functionalities or appearance of webpages and are conducted in the white-box scenario, making it less practical. Consequently, it becomes imperative to understand whether it is possible to launch evasion attacks with limited knowledge of the classifier, while preserving the functionalities and appearance. In this work, we show that even in the grey-, and black-box scenarios, evasion attacks are not only effective on practical ML-based classifiers, but can also be efficiently launched without destructing the functionalities and appearance. For this purpose, we propose three mutation-based attacks, differing in the knowledge of the target classifier, addressing a key technical challenge: automatically crafting an adversarial sample from a known phishing website in a way that can mislead classifiers. To launch attacks in the white- and grey-box scenarios, we also propose a sample-based collision attack to gain the knowledge of the target classifier. We demonstrate the effectiveness and efficiency of our evasion attacks on the state-of-the-art, Google's phishing page filter, achieved 100% attack success rate in less than one second per website. Moreover, the transferability attack on BitDefender's industrial phishing page classifier, TrafficLight, achieved up to 81.25% attack success rate. We further propose a similarity-based method to mitigate such evasion attacks, Pelican. We demonstrate that Pelican can effectively detect evasion attacks. Our findings contribute to design more robust phishing website classifiers in practice.

SEDec 6, 2019
ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking

Shangqing Liu, Cuiyun Gao, Sen Chen et al.

Commit messages record code changes (e.g., feature modifications and bug repairs) in natural language, and are useful for program comprehension. Due to the frequent updates of software and time cost, developers are generally unmotivated to write commit messages for code changes. Therefore, automating the message writing process is necessitated. Previous studies on commit message generation have been benefited from generation models or retrieval models, but the code structure of changed code, i.e., AST, which can be important for capturing code semantics, has not been explicitly involved. Moreover, although generation models have the advantages of synthesizing commit messages for new code changes, they are not easy to bridge the semantic gap between code and natural languages which could be mitigated by retrieval models. In this paper, we propose a novel commit message generation model, named ATOM, which explicitly incorporates the abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking. Specifically, the hybrid ranking module can prioritize the most accurate message from both retrieved and generated messages regarding one code change. We evaluate the proposed model ATOM on our dataset crawled from 56 popular Java repositories. Experimental results demonstrate that ATOM increases the state-of-the-art models by 30.72% in terms of BLEU-4 (an accuracy measure that is widely used to evaluate text generation systems). Qualitative analysis also demonstrates the effectiveness of ATOM in generating accurate code commit messages.

LGSep 15, 2019
An Empirical Study towards Characterizing Deep Learning Development and Deployment across Different Frameworks and Platforms

Qianyu Guo, Sen Chen, Xiaofei Xie et al.

Deep Learning (DL) has recently achieved tremendous success. A variety of DL frameworks and platforms play a key role to catalyze such progress. However, the differences in architecture designs and implementations of existing frameworks and platforms bring new challenges for DL software development and deployment. Till now, there is no study on how various mainstream frameworks and platforms influence both DL software development and deployment in practice. To fill this gap, we take the first step towards understanding how the most widely-used DL frameworks and platforms support the DL software development and deployment. We conduct a systematic study on these frameworks and platforms by using two types of DNN architectures and three popular datasets. (1) For development process, we investigate the prediction accuracy under the same runtime training configuration or same model weights/biases. We also study the adversarial robustness of trained models by leveraging the existing adversarial attack techniques. The experimental results show that the computing differences across frameworks could result in an obvious prediction accuracy decline, which should draw the attention of DL developers. (2) For deployment process, we investigate the prediction accuracy and performance (refers to time cost and memory consumption) when the trained models are migrated/quantized from PC to real mobile devices and web browsers. The DL platform study unveils that the migration and quantization still suffer from compatibility and reliability issues. Meanwhile, we find several DL software bugs by using the results as a benchmark. We further validate the results through bug confirmation from stakeholders and industrial positive feedback to highlight the implications of our study. Through our study, we summarize practical guidelines, identify challenges and pinpoint new research directions.

CRFeb 2, 2019
A Large-Scale Empirical Study on Industrial Fake Apps

Chongbin Tang, Sen Chen, Lingling Fan et al.

While there have been various studies towards Android apps and their development, there is limited discussion of the broader class of apps that fall in the fake area. Fake apps and their development are distinct from official apps and belong to the mobile underground industry. Due to the lack of knowledge of the mobile underground industry, fake apps, their ecosystem and nature still remain in mystery. To fill the blank, we conduct the first systematic and comprehensive empirical study on a large-scale set of fake apps. Over 150,000 samples related to the top 50 popular apps are collected for extensive measurement. In this paper, we present discoveries from three different perspectives, namely fake sample characteristics, quantitative study on fake samples and fake authors' developing trend. Moreover, valuable domain knowledge, like fake apps' naming tendency and fake developers' evasive strategies, is then presented and confirmed with case studies, demonstrating a clear vision of fake apps and their ecosystem.

SEFeb 1, 2019
StoryDroid: Automated Generation of Storyboard for Android Apps

Sen Chen, Lingling Fan, Chunyang Chen et al.

Mobile apps are now ubiquitous. Before developing a new app, the development team usually endeavors painstaking efforts to review many existing apps with similar purposes. The review process is crucial in the sense that it reduces market risks and provides inspiration for app development. However, manual exploration of hundreds of existing apps by different roles (e.g., product manager, UI/UX designer, developer) in a development team can be ineffective. For example, it is difficult to completely explore all the functionalities of the app in a short period of time. Inspired by the conception of storyboard in movie production, we propose a system, StoryDroid, to automatically generate the storyboard for Android apps, and assist different roles to review apps efficiently. Specifically, StoryDroid extracts the activity transition graph and leverages static analysis techniques to render UI pages to visualize the storyboard with the rendered pages. The mapping relations between UI pages and the corresponding implementation code (e.g., layout code, activity code, and method hierarchy) are also provided to users. Our comprehensive experiments unveil that StoryDroid is effective and indeed useful to assist app development. The outputs of StoryDroid enable several potential applications, such as the recommendation of UI design and layout code.

SEOct 10, 2018
Secure Deep Learning Engineering: A Software Quality Assurance Perspective

Lei Ma, Felix Juefei-Xu, Minhui Xue et al.

Over the past decades, deep learning (DL) systems have achieved tremendous success and gained great popularity in various applications, such as intelligent machines, image processing, speech processing, and medical diagnostics. Deep neural networks are the key driving force behind its recent success, but still seem to be a magic black box lacking interpretability and understanding. This brings up many open safety and security issues with enormous and urgent demands on rigorous methodologies and engineering practice for quality enhancement. A plethora of studies have shown that the state-of-the-art DL systems suffer from defects and vulnerabilities that can lead to severe loss and tragedies, especially when applied to real-world safety-critical applications. In this paper, we perform a large-scale study and construct a paper repository of 223 relevant works to the quality assurance, security, and interpretation of deep learning. We, from a software quality assurance perspective, pinpoint challenges and future opportunities towards universal secure deep learning engineering. We hope this work and the accompanied paper repository can pave the path for the software engineering community towards addressing the pressing industrial demand of secure intelligent applications.

SEAug 9, 2018
Efficiently Manifesting Asynchronous Programming Errors in Android Apps

Lingling Fan, Ting Su, Sen Chen et al.

Android, the #1 mobile app framework, enforces the single-GUI-thread model, in which a single UI thread manages GUI rendering and event dispatching. Due to this model, it is vital to avoid blocking the UI thread for responsiveness. One common practice is to offload long-running tasks into async threads. To achieve this, Android provides various async programming constructs, and leaves developers themselves to obey the rules implied by the model. However, as our study reveals, more than 25% apps violate these rules and introduce hard-to-detect, fail-stop errors, which we term as aysnc programming errors (APEs). To this end, this paper introduces APEChecker, a technique to automatically and efficiently manifest APEs. The key idea is to characterize APEs as specific fault patterns, and synergistically combine static analysis and dynamic UI exploration to detect and verify such errors. Among the 40 real-world Android apps, APEChecker unveils and processes 61 APEs, of which 51 are confirmed (83.6% hit rate). Specifically, APEChecker detects 3X more APEs than the state-of-art testing tools (Monkey, Sapienz and Stoat), and reduces testing time from half an hour to a few minutes. On a specific type of APEs, APEChecker confirms 5X more errors than the data race detection tool, EventRacer, with very few false alarms.

CRMay 14, 2018
An Empirical Assessment of Security Risks of Global Android Banking Apps

Sen Chen, Lingling Fan, Guozhu Meng et al.

Mobile banking apps, belonging to the most security-critical app category, render massive and dynamic transactions susceptible to security risks. Given huge potential financial loss caused by vulnerabilities, existing research lacks a comprehensive empirical study on the security risks of global banking apps to provide useful insights and improve the security of banking apps. Since data-related weaknesses in banking apps are critical and may directly cause serious financial loss, this paper first revisits the state-of-the-art available tools and finds that they have limited capability in identifying data-related security weaknesses of banking apps. To complement the capability of existing tools in data-related weakness detection, we propose a three-phase automated security risk assessment system, named AUSERA, which leverages static program analysis techniques and sensitive keyword identification. By leveraging AUSERA, we collect 2,157 weaknesses in 693 real-world banking apps across 83 countries, which we use as a basis to conduct a comprehensive empirical study from different aspects, such as global distribution and weakness evolution during version updates. We find that apps owned by subsidiary banks are always less secure than or equivalent to those owned by parent banks. In addition, we also track the patching of weaknesses and receive much positive feedback from banking entities so as to improve the security of banking apps in practice. To date, we highlight that 21 banks have confirmed the weaknesses we reported. We also exchange insights with 7 banks, such as HSBC in UK and OCBC in Singapore, via in-person or online meetings to help them improve their apps. We hope that the insights developed in this paper will inform the communities about the gaps among multiple stakeholders, including banks, academic researchers, and third-party security companies.

CRJun 13, 2017
Automated Poisoning Attacks and Defenses in Malware Detection Systems: An Adversarial Machine Learning Approach

Sen Chen, Minhui Xue, Lingling Fan et al.

The evolution of mobile malware poses a serious threat to smartphone security. Today, sophisticated attackers can adapt by maximally sabotaging machine-learning classifiers via polluting training data, rendering most recent machine learning-based malware detection tools (such as Drebin, DroidAPIMiner, and MaMaDroid) ineffective. In this paper, we explore the feasibility of constructing crafted malware samples; examine how machine-learning classifiers can be misled under three different threat models; then conclude that injecting carefully crafted data into training data can significantly reduce detection accuracy. To tackle the problem, we propose KuafuDet, a two-phase learning enhancing approach that learns mobile malware by adversarial detection. KuafuDet includes an offline training phase that selects and extracts features from the training set, and an online detection phase that utilizes the classifier trained by the first phase. To further address the adversarial environment, these two phases are intertwined through a self-adaptive learning scheme, wherein an automated camouflage detector is introduced to filter the suspicious false negatives and feed them back into the training phase. We finally show that KuafuDet can significantly reduce false negatives and boost the detection accuracy by at least 15%. Experiments on more than 250,000 mobile applications demonstrate that KuafuDet is scalable and can be highly effective as a standalone system.