Qin Zou

CV
h-index18
36papers
1,180citations
Novelty47%
AI Score59

36 Papers

CVJun 1Code
Divide and Conquer: Reliable Multi-View Evidential Learning for Deepfake Detection

Xiaolu Kang, Zhongyuan Wang, Jikang Cheng et al.

With the evolution of generative models, deepfakes have achieved near-perfect semantic realism, leaving forensic traces only in subtle structural anomalies. However, existing single-view paradigms often fail to generalize, as dominant semantic features overwhelm subtle artifact cues within entangled representations. This imbalance leads to overconfident yet brittle predictions -- a phenomenon we term the Semantic Masking Effect. To address this challenge, we propose a reliable framework called Divide-and-Conquer Multi-View Evidential Learning (DiCoME) for Deepfake Detection. In the "Divide" phase, we employ Geometric View Purification to decompose the entangled representation space through principled geometric projection. This process suppresses semantic interference within artifact-sensitive representations, forming the foundation for decorrelated yet complementary semantic and artifact views. In the "Conquer" phase, we leverage Uncertainty-Aware Evidential Learning to synthesize these distinct views. By explicitly modeling the "epistemic conflict" between semantic and artifact cues, this mechanism provides calibrated uncertainty estimates instead of forcing rigid deterministic decisions. Extensive experiments across multiple benchmarks demonstrate that our method consistently outperforms existing approaches in generalization performance, while providing reliable uncertainty estimation for trustworthy deepfake detection. Code is available at https://github.com/kxl0825/DiCoME.git.

CVOct 27, 2022Code
Domain Adaptive Object Detection for Autonomous Driving under Foggy Weather

Jinlong Li, Runsheng Xu, Jin Ma et al.

Most object detection methods for autonomous driving usually assume a consistent feature distribution between training and testing data, which is not always the case when weathers differ significantly. The object detection model trained under clear weather might not be effective enough in foggy weather because of the domain gap. This paper proposes a novel domain adaptive object detection framework for autonomous driving under foggy weather. Our method leverages both image-level and object-level adaptation to diminish the domain discrepancy in image style and object appearance. To further enhance the model's capabilities under challenging samples, we also come up with a new adversarial gradient reversal layer to perform adversarial mining for the hard examples together with domain adaptation. Moreover, we propose to generate an auxiliary domain by data augmentation to enforce a new domain-level metric regularization. Experimental results on public benchmarks show the effectiveness and accuracy of the proposed method. The code is available at https://github.com/jinlong17/DA-Detect.

CVNov 12, 2022Code
Line Drawing Guided Progressive Inpainting of Mural Damage

Luxi Li, Qin Zou, Fan Zhang et al.

Mural image inpainting is far less explored compared to its natural image counterpart and remains largely unsolved. Most existing image-inpainting methods tend to take the target image as the only input and directly repair the damage to generate a visually plausible result. These methods obtain high performance in restoration or completion of some pre-defined objects, e.g., human face, fabric texture, and printed texts, etc., however, are not suitable for repairing murals with varying subjects and large damaged areas. Moreover, due to discrete colors in paints, mural inpainting may suffer from apparent color bias. To this end, in this paper, we propose a line drawing guided progressive mural inpainting method. It divides the inpainting process into two steps: structure reconstruction and color correction, implemented by a structure reconstruction network (SRN) and a color correction network (CCN), respectively. In structure reconstruction, SRN utilizes the line drawing as an assistant to achieve large-scale content authenticity and structural stability. In color correction, CCN operates a local color adjustment for missing pixels which reduces the negative effects of color bias and edge jumping. The proposed approach is evaluated against the current state-of-the-art image inpainting methods. Qualitative and quantitative results demonstrate the superiority of the proposed method in mural image inpainting. The codes and data are available at https://github.com/qinnzou/mural-image-inpainting.

CVMar 14, 2023Code
Adjacent-view Transformers for Supervised Surround-view Depth Estimation

Xianda Guo, Wenjie Yuan, Yunpeng Zhang et al.

Depth estimation has been widely studied and serves as the fundamental step of 3D perception for robotics and autonomous driving. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose an Adjacent-View Transformer for Supervised Surround-view Depth estimation (AVT-SSDepth), to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module that combines CNN with transformer layers for enriched representations. Further, the adjacent-view attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surroundview feature maps. In addition, AVT-SSDepth has strong crossdataset generalization. Extensive experiments show that our method achieves superior performance over existing state-ofthe-art methods on both DDAD and nuScenes datasets. Code is available at https://github.com/XiandaGuo/SSDepth.

CVJul 16, 2023
S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

Jinlong Li, Runsheng Xu, Xinyu Liu et al.

Due to the lack of enough real multi-agent data and time-consuming of labeling, existing multi-agent cooperative perception algorithms usually select the simulated sensor data for training and validating. However, the perception performance is degraded when these simulation-trained models are deployed to the real world, due to the significant domain gap between the simulated and real data. In this paper, we propose the first Simulation-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT, which considers both the Deployment Gap and Feature Gap between simulated and real data. We investigate the effects of these two types of domain gaps and propose a novel uncertainty-aware vision transformer to effectively relief the Deployment Gap and an agent-based feature adaptation module with inter-agent and ego-agent discriminators to reduce the Feature Gap. Our intensive experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality and outperform other methods significantly for point cloud-based 3D object detection.

CVJul 18, 2023
Domain Adaptation based Object Detection for Autonomous Driving in Foggy and Rainy Weather

Jinlong Li, Runsheng Xu, Xinyu Liu et al.

Typically, object detection methods for autonomous driving that rely on supervised learning make the assumption of a consistent feature distribution between the training and testing data, this such assumption may fail in different weather conditions. Due to the domain gap, a detection model trained under clear weather may not perform well in foggy and rainy conditions. Overcoming detection bottlenecks in foggy and rainy weather is a real challenge for autonomous vehicles deployed in the wild. To bridge the domain gap and improve the performance of object detection in foggy and rainy weather, this paper presents a novel framework for domain-adaptive object detection. The adaptations at both the image-level and object-level are intended to minimize the differences in image style and object appearance between domains. Furthermore, in order to improve the model's performance on challenging examples, we introduce a novel adversarial gradient reversal layer that conducts adversarial mining on difficult instances in addition to domain adaptation. Additionally, we suggest generating an auxiliary domain through data augmentation to enforce a new domain-level metric regularization. Experimental findings on public benchmark exhibit a substantial enhancement in object detection specifically for foggy and rainy driving scenarios.

CVNov 20, 2022
Coarse-to-fine Task-driven Inpainting for Geoscience Images

Huiming Sun, Jin Ma, Qing Guo et al.

The processing and recognition of geoscience images have wide applications. Most of existing researches focus on understanding the high-quality geoscience images by assuming that all the images are clear. However, in many real-world cases, the geoscience images might contain occlusions during the image acquisition. This problem actually implies the image inpainting problem in computer vision and multimedia. To the best of our knowledge, all the existing image inpainting algorithms learn to repair the occluded regions for a better visualization quality, they are excellent for natural images but not good enough for geoscience images by ignoring the geoscience related tasks. This paper aims to repair the occluded regions for a better geoscience task performance with the advanced visualization quality simultaneously, without changing the current deployed deep learning based geoscience models. Because of the complex context of geoscience images, we propose a coarse-to-fine encoder-decoder network with coarse-to-fine adversarial context discriminators to reconstruct the occluded image regions. Due to the limited data of geoscience images, we use a MaskMix based data augmentation method to exploit more information from limited geoscience image data. The experimental results on three public geoscience datasets for remote sensing scene recognition, cross-view geolocation and semantic segmentation tasks respectively show the effectiveness and accuracy of the proposed method.

CVAug 15, 2024Code
Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

Wenxuan Li, Qin Zou, Chi Chen et al.

3D object detection in driving scenarios faces the challenge of complex road environments, which can lead to the loss or incompleteness of key features, thereby affecting perception performance. To address this issue, we propose an advanced detection framework called Co-Fix3D. Co-Fix3D integrates Local and Global Enhancement (LGE) modules to refine Bird's Eye View (BEV) features. The LGE module uses Discrete Wavelet Transform (DWT) for pixel-level local optimization and incorporates an attention mechanism for global optimization. To handle varying detection difficulties, we adopt multi-head LGE modules, enabling each module to focus on targets with different levels of detection complexity, thus further enhancing overall perception capability. Experimental results show that on the nuScenes dataset's LiDAR benchmark, Co-Fix3D achieves 69.4\% mAP and 73.5\% NDS, while on the multimodal benchmark, it achieves 72.3\% mAP and 74.7\% NDS. The source code is publicly available at \href{https://github.com/rubbish001/Co-Fix3d}{https://github.com/rubbish001/Co-Fix3d}.

CVAug 13, 2024
ED$^4$: Explicit Data-level Debiasing for Deepfake Detection

Jikang Cheng, Ying Zhang, Qin Zou et al.

Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED$^4$, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED$^4$ is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches.

CVMay 19
Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Wenxuan Li, Qin Zou, Shoubing Chen et al.

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

CVAug 13, 2024
IDRetracor: Towards Visual Forensics Against Malicious Face Swapping

Jikang Cheng, Jiaxin Ai, Zhen Han et al.

The face swapping technique based on deepfake methods poses significant social risks to personal identity security. While numerous deepfake detection methods have been proposed as countermeasures against malicious face swapping, they can only output binary labels (Fake/Real) for distinguishing fake content without reliable and traceable evidence. To achieve visual forensics and target face attribution, we propose a novel task named face retracing, which considers retracing the original target face from the given fake one via inverse mapping. Toward this goal, we propose an IDRetracor that can retrace arbitrary original target identities from fake faces generated by multiple face swapping methods. Specifically, we first adopt a mapping resolver to perceive the possible solution space of the original target face for the inverse mappings. Then, we propose mapping-aware convolutions to retrace the original target face from the fake one. Such convolutions contain multiple kernels that can be combined under the control of the mapping resolver to tackle different face swapping mappings dynamically. Extensive experiments demonstrate that the IDRetracor exhibits promising retracing performance from both quantitative and qualitative perspectives.

CVSep 16, 2025Code
StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo

Xianda Guo, Chenming Zhang, Ruilin Wang et al.

Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at https://github.com/XiandaGuo/OpenStereo, and data can be available at https://xiandaguo.net/StereoCarla.

CVNov 21, 2024Code
Stereo Anything: Unifying Zero-shot Stereo Matching with Large-Scale Mixed Data

Xianda Guo, Chenming Zhang, Youmin Zhang et al.

Stereo matching serves as a cornerstone in 3D vision, aiming to establish pixel-wise correspondences between stereo image pairs for depth recovery. Despite remarkable progress driven by deep neural architectures, current models often exhibit severe performance degradation when deployed in unseen domains, primarily due to the limited diversity of training data. In this work, we introduce StereoAnything, a data-centric framework that substantially enhances the zero-shot generalization capability of existing stereo models. Rather than devising yet another specialized architecture, we scale stereo training to an unprecedented level by systematically unifying heterogeneous stereo sources: (1) curated labeled datasets covering diverse environments, and (2) large-scale synthetic stereo pairs generated from unlabeled monocular images. Our mixed-data strategy delivers consistent and robust learning signals across domains, effectively mitigating dataset bias. Extensive zero-shot evaluations on four public benchmarks demonstrate that Stereo Anything achieves state-of-the-art generalization. This work paves the way towards truly universal stereo matching, offering a scalable data paradigm applicable to any stereo image pair. We extensively evaluate the zero-shot capabilities of our model on four public datasets, showcasing its impressive ability to generalize to any stereo image pair. Code is available at https://github.com/XiandaGuo/OpenStereo.

CVSep 14, 2021Code
Luminance Attentive Networks for HDR Image and Panorama Reconstruction

Hanning Yu, Wentao Liu, Chengjiang Long et al.

It is very challenging to reconstruct a high dynamic range (HDR) from a low dynamic range (LDR) image as an ill-posed problem. This paper proposes a luminance attentive network named LANet for HDR reconstruction from a single LDR image. Our method is based on two fundamental observations: (1) HDR images stored in relative luminance are scale-invariant, which means the HDR images will hold the same information when multiplied by any positive real number. Based on this observation, we propose a novel normalization method called " HDR calibration " for HDR images stored in relative luminance, calibrating HDR images into a similar luminance scale according to the LDR images. (2) The main difference between HDR images and LDR images is in under-/over-exposed areas, especially those highlighted. Following this observation, we propose a luminance attention module with a two-stream structure for LANet to pay more attention to the under-/over-exposed areas. In addition, we propose an extended network called panoLANet for HDR panorama reconstruction from an LDR panorama and build a dualnet structure for panoLANet to solve the distortion problem caused by the equirectangular panorama. Extensive experiments show that our proposed approach LANet can reconstruct visually convincing HDR images and demonstrate its superiority over state-of-the-art approaches in terms of all metrics in inverse tone mapping. The image-based lighting application with our proposed panoLANet also demonstrates that our method can simulate natural scene lighting using only LDR panorama. Our source code is available at https://github.com/LWT3437/LANet.

CVDec 11, 2024
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Yihan Cao, Jiazhao Zhang, Zhinan Yu et al.

Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.

CVNov 18, 2024
Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection

Jikang Cheng, Zhiyuan Yan, Ying Zhang et al.

The rapid advancement of face forgery techniques has introduced a growing variety of forgeries. Incremental Face Forgery Detection (IFFD), involving gradually adding new forgery data to fine-tune the previously trained model, has been introduced as a promising strategy to deal with evolving forgery methods. However, a naively trained IFFD model is prone to catastrophic forgetting when new forgeries are integrated, as treating all forgeries as a single ''Fake" class in the Real/Fake classification can cause different forgery types overriding one another, thereby resulting in the forgetting of unique characteristics from earlier tasks and limiting the model's effectiveness in learning forgery specificity and generality. In this paper, we propose to stack the latent feature distributions of previous and new tasks brick by brick, $\textit{i.e.}$, achieving $\textbf{aligned feature isolation}$. In this manner, we aim to preserve learned forgery information and accumulate new knowledge by minimizing distribution overriding, thereby mitigating catastrophic forgetting. To achieve this, we first introduce Sparse Uniform Replay (SUR) to obtain the representative subsets that could be treated as the uniformly sparse versions of the previous global distributions. We then propose a Latent-space Incremental Detector (LID) that leverages SUR data to isolate and align distributions. For evaluation, we construct a more advanced and comprehensive benchmark tailored for IFFD. The leading experimental results validate the superiority of our method.

CVNov 20, 2023
NePF: Neural Photon Field for Single-Stage Inverse Rendering

Tuen-Yue Tsui, Qin Zou

We present a novel single-stage framework, Neural Photon Field (NePF), to address the ill-posed inverse rendering from multi-view images. Contrary to previous methods that recover the geometry, material, and illumination in multiple stages and extract the properties from various multi-layer perceptrons across different neural fields, we question such complexities and introduce our method - a single-stage framework that uniformly recovers all properties. NePF achieves this unification by fully utilizing the physical implication behind the weight function of neural implicit surfaces and the view-dependent radiance. Moreover, we introduce an innovative coordinate-based illumination model for rapid volume physically-based rendering. To regularize this illumination, we implement the subsurface scattering model for diffuse estimation. We evaluate our method on both real and synthetic datasets. The results demonstrate the superiority of our approach in recovering high-fidelity geometry and visual-plausible material attributes.

LGApr 23, 2025
PIN-WM: Learning Physics-INformed World Models for Non-Prehensile Manipulation

Wenxuan Li, Hang Zhao, Zhiyuan Yu et al.

While non-prehensile manipulation (e.g., controlled pushing/poking) constitutes a foundational robotic skill, its learning remains challenging due to the high sensitivity to complex physical interactions involving friction and restitution. To achieve robust policy learning and generalization, we opt to learn a world model of the 3D rigid body dynamics involved in non-prehensile manipulations and use it for model-based reinforcement learning. We propose PIN-WM, a Physics-INformed World Model that enables efficient end-to-end identification of a 3D rigid body dynamical system from visual observations. Adopting differentiable physics simulation, PIN-WM can be learned with only few-shot and task-agnostic physical interaction trajectories. Further, PIN-WM is learned with observational loss induced by Gaussian Splatting without needing state estimation. To bridge Sim2Real gaps, we turn the learned PIN-WM into a group of Digital Cousins via physics-aware randomizations which perturb physics and rendering parameters to generate diverse and meaningful variations of the PIN-WM. Extensive evaluations on both simulation and real-world tests demonstrate that PIN-WM, enhanced with physics-aware digital cousins, facilitates learning robust non-prehensile manipulation skills with Sim2Real transfer, surpassing the Real2Sim2Real state-of-the-arts.

LGMay 28, 2025
Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi, Yelin He, Jiaquan Ye et al.

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of ${\bW_q}^{\top} \bW_k$, which is the reason for a malignant entropy collapse, where ${\bW_q}$ and $\bW_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio $\frac{σ_{1}(\nabla \bW_t)}{σ_{1}(\bW_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $\frac{σ_{1}(\bW_{t-1})}{σ_{1}(\nabla \bW_t)}$, where $\nabla \bW_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

CVSep 16, 2025
Double Helix Diffusion for Cross-Domain Anomaly Image Generation

Linchun Wu, Qin Zou, Xianbiao Qi et al.

Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.

CVAug 19, 2025
ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

Xianda Guo, Ruijun Zhang, Yiqun Duan et al.

Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving. ROVR comprises 200K high-resolution frames across highway, rural, and urban scenarios, spanning day/night and adverse weather conditions. A lightweight acquisition pipeline ensures scalable collection, while sparse but statistically sufficient ground truth supports robust training. Benchmarking with state-of-the-art monocular depth models reveals severe cross-dataset generalization failures: models achieving near-ceiling accuracy on KITTI degrade drastically on ROVR, and even when trained on ROVR, current methods fall short of saturation. These results highlight the unique challenges posed by ROVR-scene diversity, dynamic environments, and sparse ground truth, establishing it as a demanding new platform for advancing depth estimation and building models with stronger real-world robustness. Extensive ablation studies provide a more intuitive understanding of our dataset across different scenarios, lighting conditions, and generalized ability.

CVMay 21, 2025
Improving the generalization of gait recognition with limited datasets

Qian Zhou, Xianda Guo, Jilong Wang et al.

Generalized gait recognition remains challenging due to significant domain shifts in viewpoints, appearances, and environments. Mixed-dataset training has recently become a practical route to improve cross-domain robustness, but it introduces underexplored issues: 1) inter-dataset supervision conflicts, which distract identity learning, and 2) redundant or noisy samples, which reduce data efficiency and may reinforce dataset-specific patterns. To address these challenges, we introduce a unified paradigm for cross-dataset gait learning that simultaneously improves motion-signal quality and supervision consistency. We first increase the reliability of training data by suppressing sequences dominated by redundant gait cycles or unstable silhouettes, guided by representation redundancy and prediction uncertainty. This refinement concentrates learning on informative gait dynamics when mixing heterogeneous datasets. In parallel, we stabilize supervision by disentangling metric learning across datasets, forming triplets within each source to prevent destructive cross-domain gradients while preserving transferable identity cues. These components act in synergy to stabilize optimization and strengthen generalization without modifying network architectures or requiring extra annotations. Experiments on CASIA-B, OU-MVLP, Gait3D, and GREW with both GaitBase and DeepGaitV2 backbones consistently show improved cross-domain performance without sacrificing in-domain accuracy. These results demonstrate that data selection and aligning supervision effectively enables scalable mixed-dataset gait learning.

CVApr 23, 2025
Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

Mingxuan Cui, Qing Guo, Yuyi Wang et al.

3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.

CVMar 15, 2021
Metric Learning for Anti-Compression Facial Forgery Detection

Shenhao Cao, Qin Zou, Xiuqing Mao et al.

Detecting facial forgery images and videos is an increasingly important topic in multimedia forensics. As forgery images and videos are usually compressed into different formats such as JPEG and H264 when circulating on the Internet, existing forgery-detection methods trained on uncompressed data often suffer from significant performance degradation in identifying them. To solve this problem, we propose a novel anti-compression facial forgery detection framework, which learns a compression-insensitive embedding feature space utilizing both original and compressed forgeries. Specifically, our approach consists of three ideas: (i) extracting compression-insensitive features from both uncompressed and compressed forgeries using an adversarial learning strategy; (ii) learning a robust partition by constructing a metric loss that can reduce the distance of the paired original and compressed images in the embedding space; (iii) improving the accuracy of tampered localization with an attention-transfer module. Experimental results demonstrate that, the proposed method is highly effective in handling both compressed and uncompressed facial forgery images.

CVNov 17, 2019
Transductive Zero-Shot Hashing for Multilabel Image Retrieval

Qin Zou, Zheng Zhang, Ling Cao et al.

Hash coding has been widely used in approximate nearest neighbor search for large-scale image retrieval. Given semantic annotations such as class labels and pairwise similarities of the training data, hashing methods can learn and generate effective and compact binary codes. While some newly introduced images may contain undefined semantic labels, which we call unseen images, zeor-shot hashing techniques have been studied. However, existing zeor-shot hashing methods focus on the retrieval of single-label images, and cannot handle multi-label images. In this paper, for the first time, a novel transductive zero-shot hashing method is proposed for multi-label unseen image retrieval. In order to predict the labels of the unseen/target data, a visual-semantic bridge is built via instance-concept coherence ranking on the seen/source data. Then, pairwise similarity loss and focal quantization loss are constructed for training a hashing model using both the seen/source and unseen/target data. Extensive evaluations on three popular multi-label datasets demonstrate that, the proposed hashing method achieves significantly better results than the competing methods.

CVOct 25, 2019
An End-to-End Network for Co-Saliency Detection in One Single Image

Yuanhao Yue, Qin Zou, Hongkai Yu et al.

Co-saliency detection within a single image is a common vision problem that has received little attention and has not yet been well addressed. Existing methods often used a bottom-up strategy to infer co-saliency in an image in which salient regions are firstly detected using visual primitives such as color and shape and then grouped and merged into a co-saliency map. However, co-saliency is intrinsically perceived complexly with bottom-up and top-down strategies combined in human vision. To address this problem, this study proposes a novel end-to-end trainable network comprising a backbone net and two branch nets. The backbone net uses ground-truth masks as top-down guidance for saliency prediction, whereas the two branch nets construct triplet proposals for regional feature mapping and clustering, which drives the network to be bottom-up sensitive to co-salient regions. We construct a new dataset of 2,019 natural images with co-saliency in each image to evaluate the proposed method. Experimental results show that the proposed method achieves state-of-the-art accuracy with a running speed of 28 fps.

CVSep 9, 2019
Extreme Low Resolution Activity Recognition with Confident Spatial-Temporal Attention Transfer

Yucai Bai, Qin Zou, Xieyuanli Chen et al.

Activity recognition on extreme low-resolution videos, e.g., a resolution of 12*16 pixels, plays a vital role in far-view surveillance and privacy-preserving multimedia analysis. Low-resolution videos only contain limited information. Given the fact that one same activity may be represented by videos in both high resolution (HR) and extreme low resolution (eLR), it is worth studying to utilize the relevant HR data to improve the eLR activity recognition. In this work, we propose a novel Confident Spatial-Temporal Attention Transfer (CSTAT) for eLR activity recognition. CSTAT can acquire information from HR data by reducing the attention differences with a transfer-learning strategy. Besides, the credibility of the supervisory signal is also taken into consideration for a more confident transferring process. Experimental results on two well-known datasets, i.e., UCF101 and HMDB51, demonstrate that, the proposed method can effectively improve the accuracy of eLR activity recognition and achieve an accuracy of 59.23% on 12*16 videos in HMDB51, a state-of-the-art performance.

CVMar 6, 2019
Robust Lane Detection from Continuous Driving Scenes Using Deep Neural Networks

Qin Zou, Hanwen Jiang, Qiyu Dai et al.

Lane detection in driving scenes is an important module for autonomous vehicles and advanced driver assistance systems. In recent years, many sophisticated lane detection methods have been proposed. However, most methods focus on detecting the lane from one single image, and often lead to unsatisfactory performance in handling some extremely-bad situations such as heavy shadow, severe mark degradation, serious vehicle occlusion, and so on. In fact, lanes are continuous line structures on the road. Consequently, the lane that cannot be accurately detected in one current frame may potentially be inferred out by incorporating information of previous frames. To this end, we investigate lane detection by using multiple frames of a continuous driving scene, and propose a hybrid deep architecture by combining the convolutional neural network (CNN) and the recurrent neural network (RNN). Specifically, information of each frame is abstracted by a CNN block, and the CNN features of multiple continuous frames, holding the property of time-series, are then fed into the RNN block for feature learning and lane prediction. Extensive experiments on two large-scale datasets demonstrate that, the proposed method outperforms the competing methods in lane detection, especially in handling difficult situations.

CRDec 25, 2018
Privacy-Preserving Collaborative Deep Learning with Unreliable Participants

Lingchen Zhao, Qian Wang, Qin Zou et al.

With powerful parallel computing GPUs and massive user data, neural-network-based deep learning can well exert its strong power in problem modeling and solving, and has archived great success in many applications such as image classification, speech recognition and machine translation etc. While deep learning has been increasingly popular, the problem of privacy leakage becomes more and more urgent. Given the fact that the training data may contain highly sensitive information, e.g., personal medical records, directly sharing them among the users (i.e., participants) or centrally storing them in one single location may pose a considerable threat to user privacy. In this paper, we present a practical privacy-preserving collaborative deep learning system that allows users to cooperatively build a collective deep learning model with data of all participants, without direct data sharing and central data storage. In our system, each participant trains a local model with their own data and only shares model parameters with the others. To further avoid potential privacy leakage from sharing model parameters, we use functional mechanism to perturb the objective function of the neural network in the training process to achieve $ε$-differential privacy. In particular, for the first time, we consider the existence of~\textit{unreliable participants}, i.e., the participants with low-quality data, and propose a solution to reduce the impact of these participants while protecting their privacy. We evaluate the performance of our system on two well-known real-world datasets for regression and classification tasks. The results demonstrate that the proposed system is robust against unreliable participants, and achieves high accuracy close to the model trained in a traditional centralized manner while ensuring rigorous privacy protection.

LGNov 1, 2018
Deep Learning-Based Gait Recognition Using Smartphones in the Wild

Qin Zou, Yanling Wang, Qian Wang et al.

Compared to other biometrics, gait is difficult to conceal and has the advantage of being unobtrusive. Inertial sensors, such as accelerometers and gyroscopes, are often used to capture gait dynamics. These inertial sensors are commonly integrated into smartphones and are widely used by the average person, which makes gait data convenient and inexpensive to collect. In this paper, we study gait recognition using smartphones in the wild. In contrast to traditional methods, which often require a person to walk along a specified road and/or at a normal walking speed, the proposed method collects inertial gait data under unconstrained conditions without knowing when, where, and how the user walks. To obtain good person identification and authentication performance, deep-learning techniques are presented to learn and model the gait biometrics based on walking data. Specifically, a hybrid deep neural network is proposed for robust gait feature representation, where features in the space and time domains are successively abstracted by a convolutional neural network and a recurrent neural network. In the experiments, two datasets collected by smartphones for a total of 118 subjects are used for evaluations. The experiments show that the proposed method achieves higher than 93.5\% and 93.7\% accuracies in person identification and authentication, respectively.

CVOct 22, 2018
Dating Ancient Paintings of Mogao Grottoes Using Deeply Learnt Visual Codes

Qingquan Li, Qin Zou, De Ma et al.

Cultural heritage is the asset of all the peoples of the world. The preservation and inheritance of cultural heritage is conducive to the progress of human civilization. In northwestern China, there is a world heritage site -- Mogao Grottoes -- that has a plenty of mural paintings showing the historical cultures of ancient China. To study these historical cultures, one critical procedure is to date the mural paintings, i.e., determining the era when they were created. Until now, most mural paintings at Mogao Grottoes have been dated by directly referring to the mural texts or historical documents. However, some are still left with creation-era undetermined due to the lack of reference materials. Considering that the drawing style of mural paintings was changing along the history and the drawing style can be learned and quantified through painting data, we formulate the problem of mural-painting dating into a problem of drawing-style classification. In fact, drawing styles can be expressed not only in color or curvature, but also in some unknown forms -- the forms that have not been observed. To this end, besides sophisticated color and shape descriptors, a deep convolution neural network is designed to encode the implicit drawing styles. 3860 mural paintings collected from 194 different grottoes with determined creation-era labels are used to train the classification model and build the dating method. In experiments, the proposed dating method is applied to seven mural paintings which were previously dated with controversies, and the exciting new dating results are approved by the Dunhuang expert.

CVMar 9, 2018
Fusing Hierarchical Convolutional Features for Human Body Segmentation and Clothing Fashion Classification

Zheng Zhang, Chengfang Song, Qin Zou

The clothing fashion reflects the common aesthetics that people share with each other in dressing. To recognize the fashion time of a clothing is meaningful for both an individual and the industry. In this paper, under the assumption that the clothing fashion changes year by year, the fashion-time recognition problem is mapped into a clothing-fashion classification problem. Specifically, a novel deep neural network is proposed which achieves accurate human body segmentation by fusing multi-scale convolutional features in a fully convolutional network, and then feature learning and fashion classification are performed on the segmented parts avoiding the influence of image background. In the experiments, 9,339 fashion images from 8 continuous years are collected for performance evaluation. The results demonstrate the effectiveness of the proposed body segmentation and fashion classification methods.

CVMar 8, 2018
Improved Deep Hashing with Soft Pairwise Similarity for Multi-label Image Retrieval

Zheng Zhang, Qin Zou, Yuewei Lin et al.

Hash coding has been widely used in the approximate nearest neighbor search for large-scale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional feature-learning-based methods. Most of these methods examine the pairwise similarity on the semantic-level labels, where the pairwise similarity is generally defined in a hard-assignment way. That is, the pairwise similarity is '1' if they share no less than one class label and '0' if they do not share any. However, such similarity definition cannot reflect the similarity ranking for pairwise images that hold multiple labels. In this paper, a new deep hashing method is proposed for multi-label image retrieval by re-defining the pairwise similarity into an instance similarity, where the instance similarity is quantified into a percentage based on the normalized semantic labels. Based on the instance similarity, a weighted cross-entropy loss and a minimum mean square error loss are tailored for loss-function construction, and are efficiently used for simultaneous feature learning and hash coding. Experiments on three popular datasets demonstrate that, the proposed method outperforms the competing methods and achieves the state-of-the-art performance in multi-label image retrieval.

CVOct 31, 2016
Robust Gait Recognition by Integrating Inertial and RGBD Sensors

Qin Zou, Lihao Ni, Qian Wang et al.

Gait has been considered as a promising and unique biometric for person identification. Traditionally, gait data are collected using either color sensors, such as a CCD camera, depth sensors, such as a Microsoft Kinect, or inertial sensors, such as an accelerometer. However, a single type of sensors may only capture part of the dynamic gait features and make the gait recognition sensitive to complex covariate conditions, leading to fragile gait-based person identification systems. In this paper, we propose to combine all three types of sensors for gait data collection and gait recognition, which can be used for important identification applications, such as identity recognition to access a restricted building or area. We propose two new algorithms, namely EigenGait and TrajGait, to extract gait features from the inertial data and the RGBD (color and depth) data, respectively. Specifically, EigenGait extracts general gait dynamics from the accelerometer readings in the eigenspace and TrajGait extracts more detailed sub-dynamics by analyzing 3D dense trajectories. Finally, both extracted features are fed into a supervised classifier for gait recognition and person identification. Experiments on 50 subjects, with comparisons to several other state-of-the-art gait-recognition approaches, show that the proposed approach can achieve higher recognition accuracy and robustness.

CVAug 26, 2016
Who Leads the Clothing Fashion: Style, Color, or Texture? A Computational Study

Qin Zou, Zheng Zhang, Qian Wang et al.

It is well known that clothing fashion is a distinctive and often habitual trend in the style in which a person dresses. Clothing fashions are usually expressed with visual stimuli such as style, color, and texture. However, it is not clear which visual stimulus places higher/lower influence on the updating of clothing fashion. In this study, computer vision and machine learning techniques are employed to analyze the influence of different visual stimuli on clothing-fashion updates. Specifically, a classification-based model is proposed to quantify the influence of different visual stimuli, in which each visual stimulus's influence is quantified by its corresponding accuracy in fashion classification. Experimental results demonstrate that, on clothing-fashion updates, the style holds a higher influence than the color, and the color holds a higher influence than the texture.

IRJun 2, 2013
CRUC: Cold-start Recommendations Using Collaborative Filtering in Internet of Things

Daqiang Zhang, Qin Zou, Haoyi Xiong

The Internet of Things (IoT) aims at interconnecting everyday objects (including both things and users) and then using this connection information to provide customized user services. However, IoT does not work in its initial stages without adequate acquisition of user preferences. This is caused by cold-start problem that is a situation where only few users are interconnected. To this end, we propose CRUC scheme - Cold-start Recommendations Using Collaborative Filtering in IoT, involving formulation, filtering and prediction steps. Extensive experiments over real cases and simulation have been performed to evaluate the performance of CRUC scheme. Experimental results show that CRUC efficiently solves the cold-start problem in IoT.