LGJul 25, 2023Code
RoSAS: Deep Semi-Supervised Anomaly Detection with Contamination-Resilient Continuous SupervisionHongzuo Xu, Yijie Wang, Guansong Pang et al.
Semi-supervised anomaly detection methods leverage a few anomaly examples to yield drastically improved performance compared to unsupervised models. However, they still suffer from two limitations: 1) unlabeled anomalies (i.e., anomaly contamination) may mislead the learning process when all the unlabeled data are employed as inliers for model training; 2) only discrete supervision information (such as binary or ordinal data labels) is exploited, which leads to suboptimal learning of anomaly scores that essentially take on a continuous distribution. Therefore, this paper proposes a novel semi-supervised anomaly detection method, which devises \textit{contamination-resilient continuous supervisory signals}. Specifically, we propose a mass interpolation method to diffuse the abnormality of labeled anomalies, thereby creating new data samples labeled with continuous abnormal degrees. Meanwhile, the contaminated area can be covered by new data samples generated via combinations of data with correct labels. A feature learning-based objective is added to serve as an optimization constraint to regularize the network and further enhance the robustness w.r.t. anomaly contamination. Extensive experiments on 11 real-world datasets show that our approach significantly outperforms state-of-the-art competitors by 20%-30% in AUC-PR and obtains more robust and superior performance in settings with different anomaly contamination levels and varying numbers of labeled anomalies. The source code is available at https://github.com/xuhongzuo/rosas/.
LGJun 14, 2022
Deep Isolation Forest for Anomaly DetectionHongzuo Xu, Guansong Pang, Yijie Wang et al.
Isolation forest (iForest) has been emerging as arguably the most popular anomaly detector in recent years due to its general effectiveness across different benchmarks and strong scalability. Nevertheless, its linear axis-parallel isolation method often leads to (i) failure in detecting hard anomalies that are difficult to isolate in high-dimensional/non-linear-separable data space, and (ii) notorious algorithmic bias that assigns unexpectedly lower anomaly scores to artefact regions. These issues contribute to high false negative errors. Several iForest extensions are introduced, but they essentially still employ shallow, linear data partition, restricting their power in isolating true anomalies. Therefore, this paper proposes deep isolation forest. We introduce a new representation scheme that utilises casually initialised neural networks to map original data into random representation ensembles, where random axis-parallel cuts are subsequently applied to perform the data partition. This representation scheme facilitates high freedom of the partition in the original data space (equivalent to non-linear partition on subspaces of varying sizes), encouraging a unique synergy between random representations and random partition-based isolation. Extensive experiments show that our model achieves significant improvement over state-of-the-art isolation-based methods and deep detectors on tabular, graph and time series datasets; our model also inherits desired scalability from iForest.
LGJul 25, 2022
Calibrated One-class Classification for Unsupervised Time Series Anomaly DetectionHongzuo Xu, Yijie Wang, Songlei Jian et al.
Time series anomaly detection is instrumental in maintaining system availability in various domains. Current work in this research line mainly focuses on learning data normality deeply and comprehensively by devising advanced neural network structures and new reconstruction/prediction learning objectives. However, their one-class learning process can be misled by latent anomalies in training data (i.e., anomaly contamination) under the unsupervised paradigm. Their learning process also lacks knowledge about the anomalies. Consequently, they often learn a biased, inaccurate normality boundary. To tackle these problems, this paper proposes calibrated one-class classification for anomaly detection, realizing contamination-tolerant, anomaly-informed learning of data normality via uncertainty modeling-based calibration and native anomaly-based calibration. Specifically, our approach adaptively penalizes uncertain predictions to restrain irregular samples in anomaly contamination during optimization, while simultaneously encouraging confident predictions on regular samples to ensure effective normality learning. This largely alleviates the negative impact of anomaly contamination. Our approach also creates native anomaly examples via perturbation to simulate time series abnormal behaviors. Through discriminating these dummy anomalies, our one-class learning is further calibrated to form a more precise normality boundary. Extensive experiments on ten real-world datasets show that our model achieves substantial improvement over sixteen state-of-the-art contenders.
73.8DBMar 23
FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven FuzzingYongxin Chen, Zhiyuan Jiang, Chao Zhang et al.
Traditional database fuzzing techniques primarily focus on syntactic correctness and general SQL structures, leaving critical yet obscure DBMS features, such as system-level modes (e.g., GTID), programmatic constructs (e.g., PROCEDURE), advanced process commands (e.g., KILL), largely underexplored. Although rarely triggered by typical inputs, these features can lead to severe crashes or security issues when executed under edge-case conditions. In this paper, we present FuzzySQL, a novel LLM-powered adaptive fuzzing framework designed to uncover subtle vulnerabilities in DBMS special features. FuzzySQL combines grammar-guided SQL generation with logic-shifting progressive mutation, a novel technique that explores alternative control paths by negating conditions and restructuring execution logic, synthesizing structurally and semantically diverse test cases. To further ensure deeper execution coverage of the back end, FuzzySQL employs a hybrid error repair pipeline that unifies rule-based patching with LLM-driven semantic repair, enabling automatic correction of syntactic and context-sensitive failures. We evaluate FuzzySQL across multiple DBMSs, including MySQL, MariaDB, SQLite, PostgreSQL and Clickhouse, uncovering 64 vulnerabilities, 27 of which are tied to under-tested DBMS special features. As of this writing, 60 cases have been confirmed with 9 assigned CVE identifiers, 31 already fixed by vendors, and additional vulnerabilities scheduled to be patched in upcoming releases. Our results highlight the limitations of conventional fuzzers in semantic feature coverage and demonstrate the potential of LLM-based fuzzing to discover deeply hidden bugs in complex database systems.
77.5CGMay 15
An improved boundary-focused adaptive quadtree algorithm for circle-polygon intersection area approximationZeping Yi, Yongjun Wang, Baoshan Wang et al.
In this paper, we present an improved numerical algorithm for computing the intersection area of multiple circles and a complex polygon efficiently. This geometric problem is fundamental to applications such as wireless sensor networks and base station deployment. The key idea is a curvature-multiplicity-guided adaptive sampling strategy that dynamically concentrates sampling points in geometrically complex boundary regions. The algorithm integrates three components: (i) adaptive quadtree partitioning, (ii) analytical integration via Green's theorem for cells intersecting a single circle, and (iii) curvature-multiplicity-guided Monte Carlo subsampling for cells intersecting multiple circles, where a minimum sample count and a constant factor are introduced into the sampling size. Theoretical analysis shows that the algorithm achieves O(1/ε3/2) computational complexity while maintaining an O(ε) error bound, improving upon the O(1/ε2) complexity of classical Monte Carlo and uniform grid methods for the same error tolerance ε. Numerical experiments on complex polygons, including synthetic data and real-world scenarios, demonstrate that our algorithm outperforms five classical methods in terms of relative error. Furthermore, parameter sensitivity analysis confirms that the algorithm is robust and could make it suited for practical applications such as wireless sensor network coverage estimation.
CVMay 1, 2024Code
F2M-Reg: Unsupervised RGB-D Point Cloud Registration with Frame-to-Model OptimizationZhinan Yu, Zheng Qin, Yijie Tang et al.
This work studies the problem of unsupervised RGB-D point cloud registration, which aims at training a robust registration model without ground-truth pose supervision. Existing methods usually leverages unposed RGB-D sequences and adopt a frame-to-frame framework based on differentiable rendering to train the registration model, which enforces the photometric and geometric consistency between the two frames for supervision. However, this frame-to-frame framework is vulnerable to inconsistent factors between different frames, e.g., lighting changes, geometry occlusion, and reflective materials, which leads to suboptimal convergence of the registration model. In this paper, we propose a novel frame-to-model optimization framework named F2M-Reg for unsupervised RGB-D point cloud registration. We leverage the neural implicit field as a global model of the scene and optimize the estimated poses of the frames by registering them to the global model, and the registration model is subsequently trained with the optimized poses. Thanks to the global encoding capability of neural implicit field, our frame-to-model framework is significantly more robust to inconsistent factors between different frames and thus can provide better supervision for the registration model. Besides, we demonstrate that F2M-Reg can be further enhanced by a simplistic synthetic warming-up strategy. To this end, we construct a photorealistic synthetic dataset named Sim-RGBD to initialize the registration model for the frame-to-model optimization on real-world RGB-D sequences. Extensive experiments on four challenging benchmarks have shown that our method surpasses the previous state-of-the-art counterparts by a large margin, especially under scenarios with severe lighting changes and low overlap. Our code and models are available at https://github.com/MrIsland/F2M_Reg.
CVFeb 3, 2025
Simultaneous Automatic Picking and Manual Picking Refinement for First-BreakHaowen Bai, Zixiang Zhao, Jiangshe Zhang et al.
First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method's performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR's resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
ROFeb 21, 2024
Learning Dual-arm Object Rearrangement for Cartesian RobotsShishun Zhang, Qijin She, Wenhao Li et al.
This work focuses on the dual-arm object rearrangement problem abstracted from a realistic industrial scenario of Cartesian robots. The goal of this problem is to transfer all the objects from sources to targets with the minimum total completion time. To achieve the goal, the core idea is to develop an effective object-to-arm task assignment strategy for minimizing the cumulative task execution time and maximizing the dual-arm cooperation efficiency. One of the difficulties in the task assignment is the scalability problem. As the number of objects increases, the computation time of traditional offline-search-based methods grows strongly for computational complexity. Encouraged by the adaptability of reinforcement learning (RL) in long-sequence task decisions, we propose an online task assignment decision method based on RL, and the computation time of our method only increases linearly with the number of objects. Further, we design an attention-based network to model the dependencies between the input states during the whole task execution process to help find the most reasonable object-to-arm correspondence in each task assignment round. In the experimental part, we adapt some search-based methods to this specific setting and compare our method with them. Experimental result shows that our approach achieves outperformance over search-based methods in total execution time and computational efficiency, and also verifies the generalization of our method to different numbers of objects. In addition, we show the effectiveness of our method deployed on the real robot in the supplementary video.
LGApr 30, 2021
DRAM Failure Prediction in AIOps: Empirical Evaluation, Challenges and OpportunitiesZhiyue Wu, Hongzuo Xu, Guansong Pang et al.
DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and sustainable service of large-scale data centers. However, limited work has been done on DRAM failure prediction mainly due to the lack of public available datasets. This paper presents a comprehensive empirical evaluation of diverse machine learning techniques for DRAM failure prediction using a large-scale multi-source dataset, including more than three millions of records of kernel, address, and mcelog data, provided by Alibaba Cloud through PAKDD 2021 competition. Particularly, we first formulate the problem as a multi-class classification task and exhaustively evaluate seven popular/state-of-the-art classifiers on both the individual and multiple data sources. We then formulate the problem as an unsupervised anomaly detection task and evaluate three state-of-the-art anomaly detectors. Further, based on the empirical results and our experience of attending this competition, we discuss major challenges and present future research opportunities in this task.
CRJan 25, 2021
Few-Shot Website Fingerprinting AttackMantun Chen, Yongjun Wang, Zhiquan Qin et al.
This work introduces a novel data augmentation method for few-shot website fingerprinting (WF) attack where only a handful of training samples per website are available for deep learning model optimization. Moving beyond earlier WF methods relying on manually-engineered feature representations, more advanced deep learning alternatives demonstrate that learning feature representations automatically from training data is superior. Nonetheless, this advantage is subject to an unrealistic assumption that there exist many training samples per website, which otherwise will disappear. To address this, we introduce a model-agnostic, efficient, and Harmonious Data Augmentation (HDA) method that can improve deep WF attacking methods significantly. HDA involves both intra-sample and inter-sample data transformations that can be used in harmonious manner to expand a tiny training dataset to an arbitrarily large collection, therefore effectively and explicitly addressing the intrinsic data scarcity problem. We conducted expensive experiments to validate our HDA for boosting state-of-the-art deep learning WF attack models in both closed-world and open-world attacking scenarios, at absence and presence of strong defense. {For instance, in the more challenging and realistic evaluation scenario with WTF-PAD based defense, our HDA method surpasses the previous state-of-the-art results by more than 4% in absolute classification accuracy in the 20-shot learning case.
SRApr 9, 2015
Linearly Supporting Feature Extraction For Automated Estimation Of Stellar Atmospheric ParametersXiangru Li, Yu Lu, Georges Comte et al.
We describe a scheme to extract linearly supporting (LSU) features from stellar spectra to automatically estimate the atmospheric parameters $T_{eff}$, log$~g$, and [Fe/H]. "Linearly supporting" means that the atmospheric parameters can be accurately estimated from the extracted features through a linear model. The successive steps of the process are as follow: first, decompose the spectrum using a wavelet packet (WP) and represent it by the derived decomposition coefficients; second, detect representative spectral features from the decomposition coefficients using the proposed method Least Absolute Shrinkage and Selection Operator (LARS)$_{bs}$; third, estimate the atmospheric parameters $T_{eff}$, log$~g$, and [Fe/H] from the detected features using a linear regression method. One prominent characteristic of this scheme is its ability to evaluate quantitatively the contribution of each detected feature to the atmospheric parameter estimate and also to trace back the physical significance of that feature. This work also shows that the usefulness of a component depends on both wavelength and frequency. The proposed scheme has been evaluated on both real spectra from the Sloan Digital Sky Survey (SDSS)/SEGUE and synthetic spectra calculated from Kurucz's NEWODF models. On real spectra, we extracted 23 features to estimate $T_{eff}$, 62 features for log$~g$, and 68 features for [Fe/H]. Test consistencies between our estimates and those provided by the Spectroscopic Sarameter Pipeline of SDSS show that the mean absolute errors (MAEs) are 0.0062 dex for log$~T_{eff}$ (83 K for $T_{eff}$), 0.2345 dex for log$~g$, and 0.1564 dex for [Fe/H]. For the synthetic spectra, the MAE test accuracies are 0.0022 dex for log$~T_{eff}$ (32 K for $T_{eff}$), 0.0337 dex for log$~g$, and 0.0268 dex for [Fe/H].