CRLGJul 24, 2020

Detecting malicious PDF using CNN

arXiv:2007.12729v2
AI Analysis

This addresses a critical security threat for computer systems by automating detection, though it is incremental as it builds on existing CNN methods applied to a specific domain.

The paper tackles the problem of detecting malicious PDF files by proposing an ensemble of Convolutional Neural Networks (CNNs) that operates on byte-level data without handcrafted features, achieving a 94% detection rate on a dataset of 90,000 files and identifying new malware undetected by most antiviruses.

Malicious PDF files represent one of the biggest threats to computer security. To detect them, significant research has been done using handwritten signatures or machine learning based on manual feature extraction. Those approaches are both time-consuming, require significant prior knowledge and the list of features has to be updated with each newly discovered vulnerability. In this work, we propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file, without any handcrafted features. We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware and even detects new malicious files, still undetected by most antiviruses. Using automatically generated features from our CNN network, and applying a clustering algorithm, we also obtain high similarity between the antiviruses' labels and the resulting clusters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes