CRJul 7, 2018Code
SmartSeed: Smart Seed Generation for Efficient FuzzingChenyang Lyu, Shouling Ji, Yuwei Li et al.
Fuzzing is an automated application vulnerability detection method. For genetic algorithm-based fuzzing, it can mutate the seed files provided by users to obtain a number of inputs, which are then used to test the objective application in order to trigger potential crashes. As shown in existing literature, the seed file selection is crucial for the efficiency of fuzzing. However, current seed selection strategies do not seem to be better than randomly picking seed files. Therefore, in this paper, we propose a novel and generic system, named SmartSeed, to generate seed files towards efficient fuzzing. Specifically, SmartSeed is designed based on a machine learning model to learn and generate high-value binary seeds. We evaluate SmartSeed along with American Fuzzy Lop (AFL) on 12 open-source applications with the input formats of mp3, bmp or flv. We also combine SmartSeed with different fuzzing tools to examine its compatibility. From extensive experiments, we find that SmartSeed has the following advantages: First, it only requires tens of seconds to generate sufficient high-value seeds. Second, it can generate seeds with multiple kinds of input formats and significantly improves the fuzzing performance for most applications with the same input format. Third, SmartSeed is compatible to different fuzzing tools. In total, our system discovers more than twice unique crashes and 5,040 extra unique paths than the existing best seed selection strategy for the evaluated 12 applications. From the crashes found by SmartSeed, we discover 16 new vulnerabilities and have received their CVE IDs.
LGJul 4, 2025
Reinforcement Learning-based Feature Generation Algorithm for Scientific DataMeng Xiao, Junfeng Zhou, Yuanchun Zhou
Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.
CLFeb 8, 2024
LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker VerificationDi Cao, Xianchen Wang, Junfeng Zhou et al.
Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.
DBOct 2, 2021
Tao: A Learning Framework for Adaptive Nearest Neighbor Search using Static Features OnlyKaixiang Yang, Hongya Wang, Bo Xu et al.
Approximate nearest neighbor (ANN) search is a fundamental problem in areas such as data management,information retrieval and machine learning. Recently, Li et al. proposed a learned approach named AdaptNN to support adaptive ANN query processing. In the middle of query execution, AdaptNN collects a number of runtime features and predicts termination condition for each individual query, by which better end-to-end latency is attained. Despite its efficiency, using runtime features complicates the learning process and leads to performance degradation. Radically different from AdaptNN, we argue that it is promising to predict termination condition before query exetution. Particularly, we developed Tao, a general learning framework for Terminating ANN queries Adaptively using Only static features. Upon the arrival of a query, Tao first maps the query to a local intrinsic dimension (LID) number, and then predicts the termination condition using LID instead of runtime features. By decoupling prediction procedure from query execution, Tao eliminates the laborious feature selection process involved in AdaptNN. Besides, two design principles are formulated to guide the application of Tao and improve the explainability of the prediction model. We integrate two state-of-the-art indexing approaches, i.e., IMI and HNSW, into Tao, and evaluate the performance over several million to billion-scale datasets. Experimental results show that, in addition to its simplicity and generality , Tao achieves up to 2.69x speedup even compared to its counterpart, at the same high accuracy targets.