Xiaochun Yun

2papers

2 Papers

CLAug 13, 2024Code
CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization

Wei Peng, Junmei Ding, Wei Wang et al.

Cyber Threat Intelligence (CTI) summarization involves generating concise and accurate highlights from web intelligence data, which is critical for providing decision-makers with actionable insights to swiftly detect and respond to cyber threats in the cybersecurity domain. Despite that, the development of efficient techniques for summarizing CTI reports, comprising facts, analytical insights, attack processes, and more, has been hindered by the lack of suitable datasets. To address this gap, we introduce CTISum, a new benchmark dataset designed for the CTI summarization task. Recognizing the significance of understanding attack processes, we also propose a novel fine-grained subtask: attack process summarization, which aims to help defenders assess risks, identify security gaps, and uncover vulnerabilities. Specifically, a multi-stage annotation pipeline is designed to collect and annotate CTI data from diverse web sources, alongside a comprehensive benchmarking of CTISum using both extractive, abstractive and LLMs-based summarization methods. Experimental results reveal that current state-of-the-art models face significant challenges when applied to CTISum, highlighting that automatic summarization of CTI reports remains an open research problem. The code and example dataset can be made publicly available at https://github.com/pengwei-iie/CTISum.

CRJul 29, 2024
Efficiently and Effectively: A Two-stage Approach to Balance Plaintext and Encrypted Text for Traffic Classification

Wei Peng, Lei Cui, Wei Cai et al.

Encrypted traffic classification is the task of identifying the application or service associated with encrypted network traffic. One effective approach for this task is to use deep learning methods to encode the raw traffic bytes directly and automatically extract features for classification (byte-based models). However, current byte-based models input raw traffic bytes, whether plaintext or encrypted text, for automated feature extraction, neglecting the distinct impacts of plaintext and encrypted text on downstream tasks. Additionally, these models primarily focus on improving classification accuracy, with little emphasis on the efficiency of models. In this paper, for the first time, we analyze the impact of plaintext and encrypted text on the model's effectiveness and efficiency. Based on our observations and findings, we propose a two-phase approach to balance the trade-off between plaintext and encrypted text in traffic classification. Specifically, Stage one is to Determine whether the Plain text is enough to be accurately Classified (DPC) using the proposed DPC Selector. This stage quickly identifies samples that can be classified using plaintext, leveraging explicit byte features in plaintext to enhance model's efficiency. Stage two aims to adaptively make a classification with the result from stage one. This stage incorporates encrypted text information for samples that cannot be classified using plaintext alone, ensuring the model's effectiveness on traffic classification tasks. Experiments on two datasets demonstrate that our proposed model achieves state-of-the-art results in both effectiveness and efficiency.