CVAug 29, 2023Code
AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language ModelsZhaopeng Gu, Bingke Zhu, Guibo Zhu et al.
Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset. Code is available at https://github.com/CASIA-IVA-Lab/AnomalyGPT.
CVJul 31, 2023Code
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial ExamplesQiufan Ji, Lin Wang, Cong Shi et al.
Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to adversarial examples, threatening their practical deployment. Despite the many research endeavors have been made to tackle this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For examples, attackers can generate adversarial examples by adding, shifting, or removing points. Consequently, existing defense strategies are hard to counter unseen point cloud adversarial examples. In this paper, we first establish a comprehensive, and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and then perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation methods that consider various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45\% against various attacks, demonstrating its capability to enabling robust learners. Our codebase are open-sourced on: \url{https://github.com/qiufan319/benchmark_pc_attack.git}.
CRAug 22, 2022Code
RIBAC: Towards Robust and Imperceptible Backdoor Attack against Compact DNNHuy Phan, Cong Shi, Yi Xie et al.
Recently backdoor attack has become an emerging threat to the security of deep neural network (DNN) models. To date, most of the existing studies focus on backdoor attack against the uncompressed model; while the vulnerability of compressed DNNs, which are widely used in the practical applications, is little exploited yet. In this paper, we propose to study and develop Robust and Imperceptible Backdoor Attack against Compact DNN models (RIBAC). By performing systematic analysis and exploration on the important design knobs, we propose a framework that can learn the proper trigger patterns, model parameters and pruning masks in an efficient way. Thereby achieving high trigger stealthiness, high attack success rate and high model efficiency simultaneously. Extensive evaluations across different datasets, including the test against the state-of-the-art defense mechanisms, demonstrate the high robustness, stealthiness and model efficiency of RIBAC. Code is available at https://github.com/huyvnphan/ECCV2022-RIBAC
CVMar 14, 2022
UniVIP: A Unified Framework for Self-Supervised Visual Pre-trainingZhaowen Li, Yousong Zhu, Fan Yang et al.
Self-supervised learning (SSL) holds promise in leveraging large amounts of unlabeled data. However, the success of popular SSL methods has limited on single-centric-object images like those in ImageNet and ignores the correlation among the scene and instances, as well as the semantic difference of instances in the scene. To address the above problems, we propose a Unified Self-supervised Visual Pre-training (UniVIP), a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. The framework takes into account the representation learning at three levels: 1) the similarity of scene-scene, 2) the correlation of scene-instance, 3) the discrimination of instance-instance. During the learning, we adopt the optimal transport algorithm to automatically measure the discrimination of instances. Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance on a variety of downstream tasks, such as image classification, semi-supervised learning, object detection and segmentation. Furthermore, our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing, and surpass current self-supervised object detection methods on COCO dataset, demonstrating its universality and potential.
CVSep 20, 2022
View-Disentangled Transformer for Brain Lesion DetectionHaofeng Li, Junjia Huang, Guanbin Li et al.
Deep neural networks (DNNs) have been widely adopted in brain lesion detection and segmentation. However, locating small lesions in 2D MRI slices is challenging, and requires to balance between the granularity of 3D context aggregation and the computational complexity. In this paper, we propose a novel view-disentangled transformer to enhance the extraction of MRI features for more accurate tumour detection. First, the proposed transformer harvests long-range correlation among different positions in a 3D brain scan. Second, the transformer models a stack of slice features as multiple 2D views and enhance these features view-by-view, which approximately achieves the 3D correlation computing in an efficient way. Third, we deploy the proposed transformer module in a transformer backbone, which can effectively detect the 2D regions surrounding brain lesions. The experimental results show that our proposed view-disentangled transformer performs well for brain lesion detection on a challenging brain MRI dataset.
CVSep 23, 2024
The BRAVO Semantic Segmentation Challenge Results in UNCV2024Tuan-Hung Vu, Eduardo Valle, Andrei Bursuc et al.
We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
SDNov 10, 2022
Privacy-Utility Balanced Voice De-Identification Using Adversarial ExamplesMeng Chen, Li Lu, Jiadi Yu et al.
Faced with the threat of identity leakage during voice data publishing, users are engaged in a privacy-utility dilemma when enjoying convenient voice services. Existing studies employ direct modification or text-based re-synthesis to de-identify users' voices, but resulting in inconsistent audibility in the presence of human participants. In this paper, we propose a voice de-identification system, which uses adversarial examples to balance the privacy and utility of voice services. Instead of typical additive examples inducing perceivable distortions, we design a novel convolutional adversarial example that modulates perturbations into real-world room impulse responses. Benefit from this, our system could preserve user identity from exposure by Automatic Speaker Identification (ASI) while remaining the voice perceptual quality for non-intrusive de-identification. Moreover, our system learns a compact speaker distribution through a conditional variational auto-encoder to sample diverse target embeddings on demand. Combining diverse target generation and input-specific perturbation construction, our system enables any-to-any identify transformation for adaptive de-identification. Experimental results show that our system could achieve 98% and 79% successful de-identification on mainstream ASIs and commercial systems with an objective Mel cepstral distortion of 4.31dB and a subjective mean opinion score of 4.48.
CVApr 21, 2024Code
FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality LocalizationZhaopeng Gu, Bingke Zhu, Guibo Zhu et al.
Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo.
CVAug 27, 2024
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image SegmentationYuanbing Zhu, Bingke Zhu, Yingying Chen et al.
Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.
CVDec 4, 2024Code
UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly DetectionZhaopeng Gu, Bingke Zhu, Guibo Zhu et al.
Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often follow a "one-category-one-model" paradigm, requiring large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, such as industrial, logical, and medical anomalies, with a training-free unified model. UniVAD only needs few normal samples as references during testing to detect anomalies in previously unseen objects, without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering ($C^3$) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments on nine datasets spanning industrial, logical, and medical fields, and the results demonstrate that UniVAD achieves state-of-the-art performance in few-shot anomaly detection tasks across multiple domains, outperforming domain-specific anomaly detection models. Code is available at https://github.com/FantasticGNU/UniVAD.
CVJan 17, 2025Code
FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable LocalizationZhaopeng Gu, Bingke Zhu, Guibo Zhu et al.
Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.
54.1SDApr 9Code
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation LearningLinge Wang, Yingying Chen, Bingke Zhu et al.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
CVJul 2, 2025Code
MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video ParsingLangyu Wang, Bingke Zhu, Yingying Chen et al.
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.
CRFeb 5, 2024
DisDet: Exploring Detectability of Backdoor Attack on Diffusion ModelsYang Sui, Huy Phan, Jinqi Xiao et al.
In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content generation and editing tool for various data modalities, making the study of their potential security risks very necessary and critical. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges of this popular and fundamental AI technique. In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the properties of the trigger pattern in the existing diffusion backdoor attacks, discovering the important role of distribution discrepancy in Trojan detection. Based on this finding, we propose a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to study the same problem from the attack side, proposing a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme. Empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100\% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100\% detection pass rate with very high attack and benign performance for the backdoored diffusion models.
CVDec 30, 2024
LINK: Adaptive Modality Interaction for Audio-Visual Video ParsingLangyu Wang, Bingke Zhu, Yingying Chen et al.
Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.
LGNov 30, 2024
Friend or Foe? Harnessing Controllable Overfitting for Anomaly DetectionLong Qian, Bingke Zhu, Yingying Chen et al.
Overfitting has traditionally been viewed as detrimental to anomaly detection, where excessive generalization often limits models' sensitivity to subtle anomalies. Our work challenges this conventional view by introducing Controllable Overfitting-based Anomaly Detection (COAD), a novel framework that strategically leverages overfitting to enhance anomaly discrimination capabilities. We propose the Aberrance Retention Quotient (ARQ), a novel metric that systematically quantifies the extent of overfitting, enabling the identification of an optimal golden overfitting interval wherein model sensitivity to anomalies is maximized without sacrificing generalization. To comprehensively capture how overfitting affects detection performance, we further propose the Relative Anomaly Distribution Index (RADI), a metric superior to traditional AUROC by explicitly modeling the separation between normal and anomalous score distributions. Theoretically, RADI leverages ARQ to track and evaluate how overfitting impacts anomaly detection, offering an integrated approach to understanding the relationship between overfitting dynamics and model efficacy. We also rigorously validate the statistical efficacy of Gaussian noise as pseudo-anomaly generators, reinforcing the method's broad applicability. Empirical evaluations demonstrate that our controllable overfitting method achieves State-Of-The-Art(SOTA) performance in both one-class and multi-class anomaly detection tasks, thus redefining overfitting as a powerful strategy rather than a limitation.
CVApr 17, 2025
MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly DetectionLong Qian, Bingke Zhu, Yingying Chen et al.
Currently, industrial anomaly detection suffers from two bottlenecks: (i) the rarity of real-world defect images and (ii) the opacity of sample quality when synthetic data are used. Existing synthetic strategies (e.g., cut-and-paste) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this paper, we introduce a novel and lightweight pipeline that generates synthetic anomalies through Math-Phys model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator (SQE). By combining physical modeling of the three most typical physics-driven defect mechanisms: Fracture Line (FL), Pitting Loss (PL), and Plastic Warpage (PW), our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our method, we conduct experiments on three anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, our method achieves state-of-the-art results in both image- and pixel-AUROC, confirming the effectiveness of our MaPhC2F dataset and BiSQAD method. All code will be released.
CVApr 16, 2024
Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language ModelsEnming Zhang, Bingke Zhu, Yingying Chen et al.
Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.
CVAug 8, 2025
AnomalyMoE: Towards a Language-free Generalist Model for Unified Visual Anomaly DetectionZhaopeng Gu, Bingke Zhu, Guibo Zhu et al.
Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.
CVAug 5, 2025
Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and DetectionLong Qian, Bingke Zhu, Yingying Chen et al.
Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.
SRFeb 25, 2025
FLARE: A Framework for Stellar Flare Forecasting using Stellar Physical Properties and Historical RecordsBingke Zhu, Xiaoxiao Wang, Minghui Jia et al.
Stellar flare events are critical observational samples for astronomical research; however, recorded flare events remain limited. Stellar flare forecasting can provide additional flare event samples to support research efforts. Despite this potential, no specialized models for stellar flare forecasting have been proposed to date. In this paper, we present extensive experimental evidence demonstrating that both stellar physical properties and historical flare records are valuable inputs for flare forecasting tasks. We then introduce FLARE (Forecasting Light-curve-based Astronomical Records via features Ensemble), the first-of-its-kind large model specifically designed for stellar flare forecasting. FLARE integrates stellar physical properties and historical flare records through a novel Soft Prompt Module and Residual Record Fusion Module. Our experiments on the publicly available Kepler light curve dataset demonstrate that FLARE achieves superior performance compared to other methods across all evaluation metrics. Finally, we validate the forecast capability of our model through a comprehensive case study.
ASApr 26, 2020
Enabling Fast and Universal Audio Adversarial Attack Using Generative ModelYi Xie, Zhuohang Li, Cong Shi et al.
Recently, the vulnerability of DNN-based audio systems to adversarial attacks has obtained the increasing attention. However, the existing audio adversarial attacks allow the adversary to possess the entire user's audio input as well as granting sufficient time budget to generate the adversarial perturbations. These idealized assumptions, however, makes the existing audio adversarial attacks mostly impossible to be launched in a timely fashion in practice (e.g., playing unnoticeable adversarial perturbations along with user's streaming input). To overcome these limitations, in this paper we propose fast audio adversarial perturbation generator (FAPG), which uses generative model to generate adversarial perturbations for the audio input in a single forward pass, thereby drastically improving the perturbation generation speed. Built on the top of FAPG, we further propose universal audio adversarial perturbation generator (UAPG), a scheme crafting universal adversarial perturbation that can be imposed on arbitrary benign audio input to cause misclassification. Extensive experiments show that our proposed FAPG can achieve up to 167X speedup over the state-of-the-art audio adversarial attack methods. Also our proposed UAPG can generate universal adversarial perturbation that achieves much better attack performance than the state-of-the-art solutions.
HCMar 20, 2020
WiEat: Fine-grained Device-free Eating Monitoring Leveraging Wi-Fi SignalsChen Wang, Zhenzhe Lin, Yucheng Xie et al.
Eating is a fundamental activity in people's daily life. Studies have shown that many health-related problems such as obesity, diabetes and anemia are closely associated with people's unhealthy eating habits (e.g., skipping meals, eating irregularly and overeating). Traditional eating monitoring solutions relying on self-reports remain an onerous task, while the recent trend requiring users to wear expensive dedicated hardware is still invasive. To overcome these limitations, in this paper, we develop a device-free eating monitoring system using WiFi-enabled devices (e.g., smartphone or laptop). Our system aims to automatically monitor users' eating activities through identifying the fine-grained eating motions and detecting the chewing and swallowing. In particular, our system extracts the fine-grained Channel State Information (CSI) from WiFi signals to distinguish eating from non-eating activities and further recognizing users' detailed eating motions with different utensils (e.g., using a folk, knife, spoon or bare hands). Moreover, the system has the capability of identifying chewing and swallowing through detecting users' minute facial muscle movements based on the derived CSI spectrogram. Such fine-grained eating monitoring results are beneficial to the understanding of the user's eating behaviors and can be used to estimate food intake types and amounts. Extensive experiments with 20 users over 1600-minute eating show that the proposed system can recognize the user's eating motions with up to 95% accuracy and estimate the chewing and swallowing amount with 10% percentage error.
HCMar 20, 2020
WearID: Wearable-Assisted Low-Effort Authentication to Voice Assistants using Cross-Domain Speech SimilarityChen Wang, Cong Shi, Yingying Chen et al.
Due to the open nature of voice input, voice assistant (VA) systems (e.g., Google Home and Amazon Alexa) are under a high risk of sensitive information leakage (e.g., personal schedules and shopping accounts). Though the existing VA systems may employ voice features to identify users, they are still vulnerable to various acoustic attacks (e.g., impersonation, replay and hidden command attacks). In this work, we focus on the security issues of the emerging VA systems and aim to protect the users' highly sensitive information from these attacks. Towards this end, we propose a system, WearID, which uses an off-the-shelf wearable device (e.g., a smartwatch or bracelet) as a secure token to verify the user's voice commands to the VA system. In particular, WearID exploits the readily available motion sensors from most wearables to describe the command sound in vibration domain and check the received command sound across two domains (i.e., wearable's motion sensor vs. VA device's microphone) to ensure the sound is from the legitimate user.
HCMar 20, 2020
EchoLock: Towards Low Effort Mobile User IdentificationYilin Yang, Chen Wang, Yingying Chen et al.
User identification plays a pivotal role in how we interact with our mobile devices. Many existing authentication approaches require active input from the user or specialized sensing hardware, and studies on mobile device usage show significant interest in less inconvenient procedures. In this paper, we propose EchoLock, a low effort identification scheme that validates the user by sensing hand geometry via commodity microphones and speakers. These acoustic signals produce distinct structure-borne sound reflections when contacting the user's hand, which can be used to differentiate between different people based on how they hold their mobile devices. We process these reflections to derive unique acoustic features in both the time and frequency domain, which can effectively represent physiological and behavioral traits, such as hand contours, finger sizes, holding strength, and gesture. Furthermore, learning-based algorithms are developed to robustly identify the user under various environments and conditions. We conduct extensive experiments with 20 participants using different hardware setups in key use case scenarios and study various attack models to demonstrate the performance of our proposed system. Our results show that EchoLock is capable of verifying users with over 90% accuracy, without requiring any active input from the user.
ASMar 4, 2020
Real-time, Universal, and Robust Adversarial Attacks Against Speaker Recognition SystemsYi Xie, Cong Shi, Zhuohang Li et al.
As the popularity of voice user interface (VUI) exploded in recent years, speaker recognition system has emerged as an important medium of identifying a speaker in many security-required applications and services. In this paper, we propose the first real-time, universal, and robust adversarial attack against the state-of-the-art deep neural network (DNN) based speaker recognition system. Through adding an audio-agnostic universal perturbation on arbitrary enrolled speaker's voice input, the DNN-based speaker recognition system would identify the speaker as any target (i.e., adversary-desired) speaker label. In addition, we improve the robustness of our attack by modeling the sound distortions caused by the physical over-the-air propagation through estimating room impulse response (RIR). Experiment using a public dataset of 109 English speakers demonstrates the effectiveness and robustness of our proposed attack with a high attack success rate of over 90%. The attack launching time also achieves a 100X speedup over contemporary non-universal attacks.
CRJul 12, 2019
Motion Sensor-based Privacy Attack on SmartphonesS Abhishek Anand, Chen Wang, Jian Liu et al.
In this paper, we build a speech privacy attack that exploits speech reverberations generated from a smartphone's in-built loudspeaker captured via a zero-permission motion sensor (accelerometer). We design our attack Spearphone2, and demonstrate that speech reverberations from inbuilt loudspeakers, at an appropriate loudness, can impact the accelerometer, leaking sensitive information about the speech. In particular, we show that by exploiting the affected accelerometer readings and carefully selecting feature sets along with off-the-shelf machine learning techniques, Spearphone can successfully perform gender classification (accuracy over 90%) and speaker identification (accuracy over 80%) for any audio/video playback on the smartphone. Our results with testing the attack on a voice call and voice assistant response were also encouraging, showcasing the impact of the proposed attack. In addition, we perform speech recognition and speech reconstruction to extract more information about the eavesdropped speech to an extent. Our work brings to light a fundamental design vulnerability in many currently-deployed smartphones, which may put people's speech privacy at risk while using the smartphone in the loudspeaker mode during phone calls, media playback or voice assistant interactions.
CVJun 25, 2019
Learning Feature Embeddings for Discriminant Model based TrackingLinyu Zheng, Ming Tang, Yingying Chen et al.
After observing that the features used in most online discriminatively trained trackers are not optimal, in this paper, we propose a novel and effective architecture to learn optimal feature embeddings for online discriminative tracking. Our method, called DCFST, integrates the solver of a discriminant model that is differentiable and has a closed-form solution into convolutional neural networks. Then, the resulting network can be trained in an end-to-end way, obtaining optimal feature embeddings for the discriminant model-based tracker. As an instance, we apply the popular ridge regression model in this work to demonstrate the power of DCFST. Extensive experiments on six public benchmarks, OTB2015, NFS, GOT10k, TrackingNet, VOT2018, and VOT2019, show that our approach is efficient and generalizes well to class-agnostic target objects in online tracking, thus achieves state-of-the-art accuracy, while running beyond the real-time speed. Code will be made available.
CVDec 18, 2018
Recurrent Calibration Network for Irregular Text RecognitionYunze Gao, Yingying Chen, Jinqiao Wang et al.
Scene text recognition has received increased attention in the research community. Text in the wild often possesses irregular arrangements, typically including perspective text, curved text, oriented text. Most existing methods are hard to work well for irregular text, especially for severely distorted text. In this paper, we propose a novel Recurrent Calibration Network (RCN) for irregular scene text recognition. The RCN progressively calibrates the irregular text to boost the recognition performance. By decomposing the calibration process into multiple steps, the irregular text can be calibrated to normal one step by step. Besides, in order to avoid the accumulation of lost information caused by inaccurate transformation, we further design a fiducial-point refinement structure to keep the integrity of text during the recurrent process. Instead of the calibrated images, the coordinates of fiducial points are tracked and refined, which implicitly models the transformation information. Based on the refined fiducial points, we estimate the transformation parameters and sample from the original image at each step. In this way, the original character information is preserved until the final transformation. Such designs lead to optimal calibration results to boost the performance of succeeding recognition. Extensive experiments on challenging datasets demonstrate the superiority of our method, especially on irregular benchmarks.
CVMay 12, 2018
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask LearningFisher Yu, Haofeng Chen, Xin Wang et al.
Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in this important venue.
CVSep 13, 2017
Reading Scene Text with Attention Convolutional Sequence ModelingYunze Gao, Yingying Chen, Jinqiao Wang et al.
Reading text in the wild is a challenging task in the field of computer vision. Existing approaches mainly adopted Connectionist Temporal Classification (CTC) or Attention models based on Recurrent Neural Network (RNN), which is computationally expensive and hard to train. In this paper, we present an end-to-end Attention Convolutional Network for scene text recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers to effectively capture the contextual dependencies of the input sequence, which is characterized by lower computational complexity and easier parallel computation. Compared to the chain structure of recurrent networks, the Convolutional Neural Network (CNN) provides a natural way to capture long-term dependencies between elements, which is 9 times faster than Bidirectional Long Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation of foreground text and suppress the background noise, we incorporate the residual attention modules into a small densely connected network to improve the discriminability of CNN features. We validate the performance of our approach on the standard benchmarks, including the Street View Text, IIIT5K and ICDAR datasets. As a result, state-of-the-art or highly-competitive performance and efficiency show the superiority of the proposed approach.
CVJul 26, 2017
Fast Deep Matting for Portrait Animation on Mobile PhoneBingke Zhu, Yingying Chen, Jinqiao Wang et al.
Image matting plays an important role in image and video editing. However, the formulation of image matting is inherently ill-posed. Traditional methods usually employ interaction to deal with the image matting problem with trimaps and strokes, and cannot run on the mobile phone in real-time. In this paper, we propose a real-time automatic deep matting approach for mobile devices. By leveraging the densely connected blocks and the dilated convolution, a light full convolutional network is designed to predict a coarse binary mask for portrait images. And a feathering block, which is edge-preserving and matting adaptive, is further developed to learn the guided filter and transform the binary mask into alpha matte. Finally, an automatic portrait animation system based on fast deep matting is built on mobile devices, which does not need any interaction and can realize real-time matting with 15 fps. The experiments show that the proposed approach achieves comparable results with the state-of-the-art matting solvers.
CVJul 24, 2017
Joint Background Reconstruction and Foreground Segmentation via A Two-stage Convolutional Neural NetworkXu Zhao, Yingying Chen, Ming Tang et al.
Foreground segmentation in video sequences is a classic topic in computer vision. Due to the lack of semantic and prior knowledge, it is difficult for existing methods to deal with sophisticated scenes well. Therefore, in this paper, we propose an end-to-end two-stage deep convolutional neural network (CNN) framework for foreground segmentation in video sequences. In the first stage, a convolutional encoder-decoder sub-network is employed to reconstruct the background images and encode rich prior knowledge of background scenes. In the second stage, the reconstructed background and current frame are input into a multi-channel fully-convolutional sub-network (MCFCN) for accurate foreground segmentation. In the two-stage CNN, the reconstruction loss and segmentation loss are jointly optimized. The background images and foreground objects are output simultaneously in an end-to-end way. Moreover, by incorporating the prior semantic knowledge of foreground and background in the pre-training process, our method could restrain the background noise and keep the integrity of foreground objects at the same time. Experiments on CDNet 2014 show that our method outperforms the state-of-the-art by 4.9%.