Zhiyu Hao

3papers

10citations

Novelty37%

AI Score39

Ranked #105,931 of 201,326 authors (top 53%)#19,280 in CL (top 59%)

3 Papers

CLAug 13, 2024Code

CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization

Wei Peng, Junmei Ding, Wei Wang et al.

Cyber Threat Intelligence (CTI) summarization involves generating concise and accurate highlights from web intelligence data, which is critical for providing decision-makers with actionable insights to swiftly detect and respond to cyber threats in the cybersecurity domain. Despite that, the development of efficient techniques for summarizing CTI reports, comprising facts, analytical insights, attack processes, and more, has been hindered by the lack of suitable datasets. To address this gap, we introduce CTISum, a new benchmark dataset designed for the CTI summarization task. Recognizing the significance of understanding attack processes, we also propose a novel fine-grained subtask: attack process summarization, which aims to help defenders assess risks, identify security gaps, and uncover vulnerabilities. Specifically, a multi-stage annotation pipeline is designed to collect and annotate CTI data from diverse web sources, alongside a comprehensive benchmarking of CTISum using both extractive, abstractive and LLMs-based summarization methods. Experimental results reveal that current state-of-the-art models face significant challenges when applied to CTISum, highlighting that automatic summarization of CTI reports remains an open research problem. The code and example dataset can be made publicly available at https://github.com/pengwei-iie/CTISum.

48.0LGMay 12

More Than Meets the Eye: A Semantics-Aware Traffic Augmentation Framework for Generalizable Website Fingerprinting

Youquan Xian, Xueying Zeng, Lingjia Meng et al.

Deep learning-based website fingerprinting has emerged as an effective technique for inferring the websites users visit. Although existing methods achieve strong performance on closed-world datasets, they often fail to generalize to real-world environments, especially under geographic and temporal shifts. This limitation fundamentally stems from the coupled effects of two key challenges: application-layer resource composition variability and observable feature instability induced by cross-layer encapsulation. Intertwined, these factors induce systematic shifts between underlying application semantics and observable traffic features. To address the above challenges, we propose SATA , a semantics-aware traffic augmentation framework. Specifically, SATA first performs application-layer semantic augmentation based on protocol rules, expanding the resource composition patterns within each flow and frame sequence patterns under protocol constraints. Based on these augmented frame sequences, we further introduce a cross-layer feature alignment mechanism via knowledge distillation. It aligns frame sequence with packet-length sequence features, enabling cross-layer feature alignment between enhanced semantics and observable sequences. Extensive experiments show that SATA successfully generates traffic patterns that are absent from the training set but genuinely exist in the test set, and significantly improves the performance of mainstream models across diverse and complex scenarios. In particular, in open-world settings, SATA improves ACC by 90.81% and AUROC by 48.37%. The source code of the prototype system is available at https://anonymous.4open.science/r/SATA-B6C2/.

CRJul 29, 2024

Efficiently and Effectively: A Two-stage Approach to Balance Plaintext and Encrypted Text for Traffic Classification

Wei Peng, Lei Cui, Wei Cai et al.

Encrypted traffic classification is the task of identifying the application or service associated with encrypted network traffic. One effective approach for this task is to use deep learning methods to encode the raw traffic bytes directly and automatically extract features for classification (byte-based models). However, current byte-based models input raw traffic bytes, whether plaintext or encrypted text, for automated feature extraction, neglecting the distinct impacts of plaintext and encrypted text on downstream tasks. Additionally, these models primarily focus on improving classification accuracy, with little emphasis on the efficiency of models. In this paper, for the first time, we analyze the impact of plaintext and encrypted text on the model's effectiveness and efficiency. Based on our observations and findings, we propose a two-phase approach to balance the trade-off between plaintext and encrypted text in traffic classification. Specifically, Stage one is to Determine whether the Plain text is enough to be accurately Classified (DPC) using the proposed DPC Selector. This stage quickly identifies samples that can be classified using plaintext, leveraging explicit byte features in plaintext to enhance model's efficiency. Stage two aims to adaptively make a classification with the result from stage one. This stage incorporates encrypted text information for samples that cannot be classified using plaintext alone, ensuring the model's effectiveness on traffic classification tasks. Experiments on two datasets demonstrate that our proposed model achieves state-of-the-art results in both effectiveness and efficiency.