CRAINov 27, 2025

Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

arXiv:2511.22095v1Has Code
Originality Synthesis-oriented
AI Analysis

This provides an accessible dataset for researchers, practitioners, and students in binary analysis and malware detection, though it is incremental as it addresses an infrastructure gap rather than a novel method.

The authors tackled the lack of a heterogeneous dataset for deep learning in binary analysis by introducing Binary-30K, which includes 29,793 binaries across multiple platforms and architectures, enabling research on malware detection and cross-target transfer learning.

Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes