Avast-CTU Public CAPE Dataset
This provides a new dataset for researchers in cybersecurity to develop machine learning methods for malware detection, though it is incremental as it focuses on data availability rather than novel methods.
The authors addressed the lack of publicly available malware datasets from dynamic sandboxes by releasing the Avast-CTU Public CAPE Dataset, which includes execution logs to support research in malware detection and analysis.
There is a limited amount of publicly available data to support research in malware analysis technology. Particularly, there are virtually no publicly available datasets generated from rich sandboxes such as Cuckoo/CAPE. The benefit of using dynamic sandboxes is the realistic simulation of file execution in the target machine and obtaining a log of such execution. The machine can be infected by malware hence there is a good chance of capturing the malicious behavior in the execution logs, thus allowing researchers to study such behavior in detail. Although the subsequent analysis of log information is extensively covered in industrial cybersecurity backends, to our knowledge there has been only limited effort invested in academia to advance such log analysis capabilities using cutting edge techniques. We make this sample dataset available to support designing new machine learning methods for malware detection, especially for automatic detection of generic malicious behavior. The dataset has been collected in cooperation between Avast Software and Czech Technical University - AI Center (AIC).