CR AI CLFeb 16, 2025

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

arXiv:2502.11191v323.623 citationsh-index: 2Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses a data bottleneck for researchers and practitioners in cybersecurity, enabling better LLM training, though it is incremental as it applies existing methods to new data.

The paper tackles the lack of open-source datasets for cybersecurity LLM training by presenting a comprehensive suite covering pretraining, instruction fine-tuning, and reasoning distillation, resulting in a 15.9% improvement in aggregate score from continual pre-training and a 15.8% gain in security certification from reasoning distillation.

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

View on arXiv PDF

Similar