CRAIMar 4

CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts

arXiv:2603.04186v1h-index: 29Has Code
Originality Incremental advance
AI Analysis

This dataset addresses the scarcity of publicly available and labeled data for training and evaluating automated log analysis methods, particularly for those leveraging Large Language Models, which is a problem for cybersecurity researchers and practitioners.

The authors introduce the Cyber Attack Manifestation Log Data Set (CAM-LDS), which includes seven attack scenarios covering 81 distinct techniques across 13 tactics, collected from 18 sources in an open-source environment. They demonstrate its utility with an LLM, achieving perfect prediction of attack techniques for approximately one-third of attack steps and adequate prediction for another third.

Log data are essential for intrusion detection and forensic investigations. However, manual log analysis is tedious due to high data volumes, heterogeneous event formats, and unstructured messages. Even though many automated methods for log analysis exist, they usually still rely on domain-specific configurations such as expert-defined detection rules, handcrafted log parsers, or manual feature-engineering. Crucially, the level of automation of conventional methods is limited due to their inability to semantically understand logs and explain their underlying causes. In contrast, Large Language Models enable domain- and format-agnostic interpretation of system logs and security alerts. Unfortunately, research on this topic remains challenging, because publicly available and labeled data sets covering a broad range of attack techniques are scarce. To address this gap, we introduce the Cyber Attack Manifestation Log Data Set (CAM-LDS), comprising seven attack scenarios that cover 81 distinct techniques across 13 tactics and collected from 18 distinct sources within a fully open-source and reproducible test environment. We extract log events that directly result from attack executions to facilitate analysis of manifestations concerning command observability, event frequencies, performance metrics, and intrusion detection alerts. We further present an illustrative case study utilizing an LLM to process the CAM-LDS. The results indicate that correct attack techniques are predicted perfectly for approximately one third of attack steps and adequately for another third, highlighting the potential of LLM-based log interpretation and utility of our data set.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes