CRMay 29

When Entropy Is Not Enough: Multi-Modal Classification of Encrypted and Compressed Data Fragments

arXiv:2605.3133710.0
AI Analysis

This work improves the reliable identification of encrypted data fragments, which is crucial for cybersecurity applications like ransomware detection and digital forensics, by providing a more effective solution for small data fragments.

This paper addresses the challenge of distinguishing encrypted from compressed data fragments, especially for small fragments (512-2048 Bytes), where traditional methods struggle. The proposed multi-modal ensemble architecture, Triumvir, integrates statistical, sequential, and spatial representations, achieving up to +4.5pp gain in binary and +6.4pp in multiclass classification over state-of-the-art methods.

Reliable identification of encrypted data fragments is essential in cybersecurity, with applications to ransomware detection, digital forensics, and large-scale data analysis. Distinguishing encrypted from compressed fragments is particularly challenging, as short fragments lack structural data and exhibit low statistical redundancy. Traditional statistical methods based on byte-level distributions show limited effectiveness on this task. Recent machine learning approaches improve performance by learning subtle patterns from raw bytes, but predominantly rely on single-modal representations, implicitly assuming that a single view of the data is sufficient for accurate classification. This paper shows that this assumption becomes a fundamental limitation in low-information settings, when only small fragments of data are available (512--2048 Bytes). We propose Triumvir, a multi-modal, uncertainty-aware ensemble architecture that integrates statistical, sequential, and spatial representations of raw byte fragments. Extensive experimental analysis demonstrates that Triumvir consistently outperforms state-of-the-art methods with gains of up to +4.5pp in binary and +6.4pp in multiclass classification. Ablation studies confirm that combining modalities is critical, yielding improvements of up to +5pp over partial configurations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes