LGFeb 2, 2024Code
TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)Zeliang Kan, Shae McFadden, Daniel Arp et al.
Machine learning (ML) plays a pivotal role in detecting malicious software. Despite the high F1-scores reported in numerous studies reaching upwards of 0.99, the issue is not completely solved. Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods, which can render previously learned knowledge insufficient for accurate decision-making on new inputs. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task: spatial bias caused by data distributions that are not representative of a real-world deployment; and temporal bias caused by incorrect time splits of data, leading to unrealistic configurations. To address these biases, we introduce a set of constraints for fair experiment design, and propose a new metric, AUT, for classifier robustness in real-world settings. We additionally propose an algorithm designed to tune training data to enhance classifier performance. Finally, we present TESSERACT, an open-source framework for realistic classifier comparison. Our evaluation encompasses both traditional ML and deep learning methods, examining published works on an extensive Android dataset with 259,230 samples over a five-year span. Additionally, we conduct case studies in the Windows PE and PDF domains. Our findings identify the existence of biases in previous studies and reveal that significant performance enhancements are possible through appropriate, periodic tuning. We explore how mitigation strategies may support in achieving a more stable and better performance over time by employing multiple strategies to delay performance decay.
LGSep 12, 2024
BLens: Contrastive Captioning of Binary Functions using Ensemble EmbeddingTristan Benoit, Yunru Wang, Moritz Dannehl et al.
Function names can greatly aid human reverse engineers, which has spurred the development of machine learning-based approaches to predicting function names in stripped binaries. Much current work in this area now uses transformers, applying a metaphor of machine translation from code to function names. Still, function naming models face challenges in generalizing to projects unrelated to the training set. In this paper, we take a completely new approach by transferring advances in automated image captioning to the domain of binary reverse engineering, such that different parts of a binary function can be associated with parts of its name. We propose BLens, which combines multiple binary function embeddings into a new ensemble representation, aligns it with the name representation latent space via a contrastive learning approach, and generates function names with a transformer architecture tailored for function names. Our experiments demonstrate that BLens significantly outperforms the state of the art. In the usual setting of splitting per binary, we achieve an $F_1$ score of 0.79 compared to 0.70. In the cross-project setting, which emphasizes generalizability, we achieve an $F_1$ score of 0.46 compared to 0.29. Finally, in an experimental setting reducing shared components across projects, we achieve an $F_1$ score of $0.32$ compared to $0.19$.
CRJul 20, 2018Code
TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and TimeFeargus Pendlebury, Fabio Pierazzi, Roberto Jordaney et al.
Is Android malware classification a solved problem? Published F1 scores of up to 0.99 appear to leave very little room for improvement. In this paper, we argue that results are commonly inflated due to two pervasive sources of experimental bias: "spatial bias" caused by distributions of training and testing data that are not representative of a real-world deployment; and "temporal bias" caused by incorrect time splits of training and testing sets, leading to impossible configurations. We propose a set of space and time constraints for experiment design that eliminates both sources of bias. We introduce a new metric that summarizes the expected robustness of a classifier in a real-world setting, and we present an algorithm to tune its performance. Finally, we demonstrate how this allows us to evaluate mitigation strategies for time decay such as active learning. We have implemented our solutions in TESSERACT, an open source evaluation framework for comparing malware classifiers in a realistic setting. We used TESSERACT to evaluate three Android malware classifiers from the literature on a dataset of 129K applications spanning over three years. Our evaluation confirms that earlier published results are biased, while also revealing counter-intuitive performance and showing that appropriate tuning can lead to significant improvements.
CRAug 31, 2021
Cats vs. Spectre: An Axiomatic Approach to Modeling Speculative Execution AttacksHernán Ponce-de-León, Johannes Kinder
The Spectre family of speculative execution attacks have required a rethinking of formal methods for security. Approaches based on operational speculative semantics have made initial inroads towards finding vulnerable code and validating defenses. However, with each new attack grows the amount of microarchitectural detail that has to be integrated into the underlying semantics. We propose an alternative, light-weight and axiomatic approach to specifying speculative semantics that relies on insights from memory models for concurrency. We use the CAT modeling language for memory consistency to specify execution models that capture speculative control flow, store-to-load forwarding, predictive store forwarding, and memory ordering machine clears. We present a bounded model checking framework parametrized by our speculative CAT models and evaluate its implementation against the state of the art. Due to the axiomatic approach, our models can be rapidly extended to allow our framework to detect new types of attacks and validate defenses against them.
SEAug 16, 2021
Systematic Generation of Conformance Tests for JavaScriptBlake Loring, Johannes Kinder
JavaScript implementations are tested for conformance to the ECMAScript standard using a large hand-written test suite. Not only in this a tedious approach, it also relies solely on the natural language specification for differentiating behaviors, while hidden implementation details can also affect behavior and introduce divergences. We propose to generate conformance tests through dynamic symbolic execution of polyfills, drop-in replacements for newer JavaScript language features that are not yet widely supported. We then run these generated tests against multiple implementations of JavaScript, using a majority vote to identify the correct behavior. To facilitate test generation for polyfill code, we introduce a model for structured symbolic inputs that is suited to the dynamic nature of JavaScript. In our evaluation, we found 17 divergences in the widely used core-js polyfill and were able to increase branch coverage in interpreter code by up to 15%. Because polyfills are typically written even before standardization, our approach will allow to maintain and extend standardization test suites with reduced effort.
CRJul 28, 2021
XFL: Naming Functions in Binaries with Extreme Multi-label LearningJames Patrick-Evans, Moritz Dannehl, Johannes Kinder
Reverse engineers benefit from the presence of identifiers such as function names in a binary, but usually these are removed for release. Training a machine learning model to predict function names automatically is promising but fundamentally hard: unlike words in natural language, most function names occur only once. In this paper, we address this problem by introducing eXtreme Function Labeling (XFL), an extreme multi-label learning approach to selecting appropriate labels for binary functions. XFL splits function names into tokens, treating each as an informative label akin to the problem of tagging texts in natural language. We relate the semantics of binary code to labels through DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph and global context from the entire binary. We demonstrate that XFL/DEXTER outperforms the state of the art in function labeling on a dataset of 10,047 binaries from the Debian project, achieving a precision of 83.5%. We also study combinations of XFL with alternative binary embeddings from the literature and show that DEXTER consistently performs best for this task. As a result, we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning, and that binary function embeddings benefit from including explicit semantic features.
CRSep 17, 2017
BabelView: Evaluating the Impact of Code Injection Attacks in Mobile WebviewsClaudio Rizzo, Lorenzo Cavallaro, Johannes Kinder
A Webview embeds a full-fledged browser in a mobile application and allows the application to expose a custom interface to JavaScript code. This is a popular technique to build so-called hybrid applications, but it circumvents the usual security model of the browser: any malicious JavaScript code injected into the Webview gains access to the interface and can use it to manipulate the device or exfiltrate sensitive data. In this paper, we present an approach to systematically evaluate the possible impact of code injection attacks against Webviews using static information flow analysis. Our key idea is that we can make reasoning about JavaScript semantics unnecessary by instrumenting the application with a model of possible attacker behavior -- the BabelView. We evaluate our approach on 11,648 apps from various Android marketplaces, finding 2,677 vulnerabilities in 1,663 apps. Taken together, the apps reported as vulnerable have over 835 million installations worldwide. We manually validated a random sample of 66 apps and estimate that our fully automated analysis achieves a precision of 90% at a recall of 66%.