CRMar 14, 2022
Toward the Detection of Polyglot FilesLuke Koch, Sean Oesch, Mary Adkisson et al.
Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.45%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.
CRJul 1, 2024
On the Abuse and Detection of Polyglot FilesLuke Koch, Sean Oesch, Amul Chaulagain et al.
A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.
AIOct 10, 2025Code
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented GenerationSamuel Hildebrand, Curtis Taylor, Sean Oesch et al.
Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").
CYFeb 10, 2025
Agentic AI and the Cyber Arms RaceSean Oesch, Jack Hutchins, Phillipe Austria et al.
Agentic AI is shifting the cybersecurity landscape as attackers and defenders leverage AI agents to augment humans and automate common tasks. In this article, we examine the implications for cyber warfare and global politics as Agentic AI becomes more powerful and enables the broad proliferation of capabilities only available to the most well resourced actors today.
CRAug 12, 2025
Attacks and Defenses Against LLM FingerprintingKevin Kurian, Ethan Holland, Sean Oesch
As large language models are increasingly deployed in sensitive environments, fingerprinting attacks pose significant privacy and security risks. We present a study of LLM fingerprinting from both offensive and defensive perspectives. Our attack methodology uses reinforcement learning to automatically optimize query selection, achieving better fingerprinting accuracy with only 3 queries compared to randomly selecting 3 queries from the same pool. Our defensive approach employs semantic-preserving output filtering through a secondary LLM to obfuscate model identity while maintaining semantic integrity. The defensive method reduces fingerprinting accuracy across tested models while preserving output quality. These contributions show the potential to improve fingerprinting tools capabilities while providing practical mitigation strategies against fingerprinting attacks.
CRApr 12, 2024
The Path To Autonomous Cyber DefenseSean Oesch, Phillipe Austria, Amul Chaulagain et al.
Defenders are overwhelmed by the number and scale of attacks against their networks.This problem will only be exacerbated as attackers leverage artificial intelligence to automate their workflows. We propose a path to autonomous cyber agents able to augment defenders by automating critical steps in the cyber defense life cycle.
CROct 13, 2025
Living Off the LLM: How LLMs Will Change Adversary TacticsSean Oesch, Jack Hutchins, Luke Koch et al.
In living off the land attacks, malicious actors use legitimate tools and processes already present on a system to avoid detection. In this paper, we explore how the on-device LLMs of the future will become a security concern as threat actors integrate LLMs into their living off the land attack pipeline and ways the security community may mitigate this threat.
HCNov 30, 2021
A Mathematical Framework for Evaluation of SOAR Tools with Limited Survey DataSavannah Norem, Ashley E Rice, Samantha Erwin et al.
Security operation centers (SOCs) all over the world are tasked with reacting to cybersecurity alerts ranging in severity. Security Orchestration, Automation, and Response (SOAR) tools streamline cybersecurity alert responses by SOC operators. SOAR tool adoption is expensive both in effort and finances. Hence, it is crucial to limit adoption to those most worthwhile; yet no research evaluating or comparing SOAR tools exists. The goal of this work is to evaluate several SOAR tools using specific criteria pertaining to their usability. SOC operators were asked to first complete a survey about what SOAR tool aspects are most important. Operators were then assigned a set of SOAR tools for which they viewed demonstration and overview videos, and then operators completed a second survey wherein they were tasked with evaluating each of the tools on the aspects from the first survey. In addition, operators provided an overall rating to each of their assigned tools, and provided a ranking of their tools in order of preference. Due to time constraints on SOC operators for thorough testing, we provide a systematic method of downselecting a large pool of SOAR tools to a select few that merit next-step hands-on evaluation by SOC operators. Furthermore, the analyses conducted in this survey help to inform future development of SOAR tools to ensure that the appropriate functions are available for use in a SOC.
CRMay 13, 2021
What Clinical Trials Can Teach Us about the Development of More Resilient AI for CybersecurityEdmon Begoli, Robert A. Bridges, Sean Oesch et al.
Policy-mandated, rigorously administered scientific testing is needed to provide transparency into the efficacy of artificial intelligence-based (AI-based) cyber defense tools for consumers and to prioritize future research and development. In this article, we propose a model that is informed by our experience, urged forward by massive scale cyberattacks, and inspired by parallel developments in the biomedical field and the unprecedentedly fast development of new vaccines to combat global pathogens.
CRApr 20, 2021
The Emperor's New Autofill Framework: A Security Analysis of Autofill on iOS and AndroidSean Oesch, Anuj Gautam, Scott Ruoti
Password managers help users more effectively manage their passwords, encouraging them to adopt stronger passwords across their many accounts. In contrast to desktop systems where password managers receive no system-level support, mobile operating systems provide autofill frameworks designed to integrate with password managers to provide secure and usable autofill for browsers and other apps installed on mobile devices. In this paper, we evaluate mobile autofill frameworks on iOS and Android, examining whether they achieve substantive benefits over the ad-hoc desktop environment or become a problematic single point of failure. Our results find that while the frameworks address several common issues, they also enforce insecure behavior and fail to provide password managers sufficient information to override the frameworks' insecure behavior, resulting in mobile managers being less secure than their desktop counterparts overall. We also demonstrate how these frameworks act as a confused deputy in manager-assisted credential phishing attacks. Our results demonstrate the need for significant improvements to mobile autofill frameworks. We conclude the paper with recommendations for the design and implementation of secure autofill frameworks.
HCDec 16, 2020
An Integrated Platform for Collaborative Data AnalyticsSean Oesch, Rob Gillen, Tom Karnowski
While collaboration among data scientists is a key to organizational productivity, data analysts face significant barriers to achieving this end, including data sharing, accessing and configuring the required computational environment, and a unified method of sharing knowledge. Each of these barriers to collaboration is related to the fundamental question of knowledge management "how can organizations use knowledge more effectively?". In this paper, we consider the problem of knowledge management in collaborative data analytics and present ShareAL, an integrated knowledge management platform, as a solution to that problem. The ShareAL platform consists of three core components: a full stack web application, a dashboard for analyzing streaming data and a High Performance Computing (HPC) cluster for performing real time analysis. Prior research has not applied knowledge management to collaborative analytics or developed a platform with the same capabilities as ShareAL. ShareAL overcomes the barriers data scientists face to collaboration by providing intuitive sharing of data and analytics via the web application, a shared computing environment via the HPC cluster and knowledge sharing and collaboration via a real time messaging application.
CRDec 16, 2020
Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware DetectionRobert A. Bridges, Sean Oesch, Miki E. Verma et al.
In this paper, we present a scientific evaluation of four prominent malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or 28\% benign) of a variety of file types, including hundreds of malicious zero-days, polyglots, and APT-style files, delivered on multiple protocols. We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide two novel applications of the recent cost-benefit evaluation procedure of Iannacone \& Bridges. While the ML-based tools are more effective at detecting zero-day files and executables, the signature-based tool may still be an overall better option. Both network-based tools provide substantial (simulated) savings when paired with either host tool, yet both show poor detection rates on protocols other than HTTP or SMTP. Our results show that all four tools have near-perfect precision but alarmingly low recall, especially on file types other than executables and office files -- 37% of malware tested, including all polyglot files, were undetected. Priorities for researchers and takeaways for end users are given.
HCDec 16, 2020
An Assessment of the Usability of Machine Learning Based Tools for the Security Operations CenterSean Oesch, Robert Bridges, Jared Smith et al.
Gartner, a large research and advisory company, anticipates that by 2024 80% of security operation centers (SOCs) will use machine learning (ML) based solutions to enhance their operations. In light of such widespread adoption, it is vital for the research community to identify and address usability concerns. This work presents the results of the first in situ usability assessment of ML-based tools. With the support of the US Navy, we leveraged the national cyber range, a large, air-gapped cyber testbed equipped with state-of-the-art network and user emulation capabilities, to study six US Naval SOC analysts' usage of two tools. Our analysis identified several serious usability issues, including multiple violations of established usability heuristics form user interface design. We also discovered that analysts lacked a clear mental model of how these tools generate scores, resulting in mistrust and/or misuse of the tools themselves. Surprisingly, we found no correlation between analysts' level of education or years of experience and their performance with either tool, suggesting that other factors such as prior background knowledge or personality play a significant role in ML-based tool usage. Our findings demonstrate that ML-based security tool vendors must put a renewed focus on working with analysts, both experienced and inexperienced, to ensure that their systems are usable and useful in real-world security operations settings.
CRAug 9, 2019
That Was Then, This Is Now: A Security Evaluation of Password Generation, Storage, and Autofill in Thirteen Password ManagersSean Oesch, Scott Ruoti
Password managers have the potential to help users more effectively manage their passwords and address many of the concerns surrounding password-based authentication, however prior research has identified significant vulnerabilities in existing password managers. Since that time, five years has passed, leaving it unclear whether password managers remain vulnerable or whether they are now ready for broad adoption. To answer this question, we evaluate thirteen popular password managers and consider all three stages of the password manager lifecycle--password generation, storage, and autofill. Our evaluation is the first analysis of password generation in password managers, finding several non-random character distributions and identifying instances where generated passwords were vulnerable to online and offline guessing attacks. For password storage and autofill, we replicate past evaluations, demonstrating that while password managers have improved in the half-decade since those prior evaluations, there are still significant issues, particularly with browser-based password managers; these problems include unencrypted metadata, unsafe defaults, and vulnerabilities to clickjacking attacks. Based on our results, we identify password managers to avoid, provide recommendations on how to improve existing password managers, and identify areas of future research.