Fang Zhao

CV
h-index18
41papers
1,892citations
Novelty55%
AI Score60

41 Papers

CVMar 15, 2023Code
Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

Jiaxu Zhang, Junwu Weng, Di Kang et al.

A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET.

CVMar 17, 2022
Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation

Tianfei Zhou, Meijie Zhang, Fang Zhao et al.

Learning semantic segmentation from weakly-labeled (e.g., image tags only) data is challenging since it is hard to infer dense object regions from sparse semantic tags. Despite being broadly studied, most current efforts directly learn from limited semantic annotations carried by individual image or image pairs, and struggle to obtain integral localization maps. Our work alleviates this from a novel perspective, by exploring rich semantic contexts synergistically among abundant weakly-labeled training data for network learning and inference. In particular, we propose regional semantic contrast and aggregation (RCA) . RCA is equipped with a regional memory bank to store massive, diverse object patterns appearing in training data, which acts as strong support for exploration of dataset-level semantic structure. Particularly, we propose i) semantic contrast to drive network learning by contrasting massive categorical object regions, leading to a more holistic object pattern understanding, and ii) semantic aggregation to gather diverse relational contexts in the memory to enrich semantic representations. In this manner, RCA earns a strong capability of fine-grained semantic understanding, and eventually establishes new state-of-the-art results on two popular benchmarks, i.e., PASCAL VOC 2012 and COCO 2014.

CVMar 20, 2023
HMC: Hierarchical Mesh Coarsening for Skeleton-free Motion Retargeting

Haoyu Wang, Shaoli Huang, Fang Zhao et al.

We present a simple yet effective method for skeleton-free motion retargeting. Previous methods transfer motion between high-resolution meshes, failing to preserve the inherent local-part motions in the mesh. Addressing this issue, our proposed method learns the correspondence in a coarse-to-fine fashion by integrating the retargeting process with a mesh-coarsening pipeline. First, we propose a mesh-coarsening module that coarsens the mesh representations for better motion transfer. This module improves the ability to handle small-part motion and preserves the local motion interdependence between neighboring mesh vertices. Furthermore, we leverage a hierarchical refinement procedure to complement missing mesh details by gradually improving the low-resolution mesh output with a higher-resolution one. We evaluate our method on several well-known 3D character datasets, and it yields an average improvement of 25% on point-wise mesh euclidean distance (PMD) against the start-of-art method. Moreover, our qualitative results show that our method is significantly helpful in preserving the moving consistency of different body parts on the target character due to disentangling body-part structures and mesh details in a hierarchical way.

CVJul 27, 2023
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Lingdong Kong, Yaru Niu, Shaoyuan Xie et al.

Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.

CVApr 3, 2023
Learning Anchor Transformations for 3D Garment Animation

Fang Zhao, Zekun Li, Shaoli Huang et al.

This paper proposes an anchor-based deformation model, namely AnchorDEF, to predict 3D garment animation from a body motion sequence. It deforms a garment mesh template by a mixture of rigid transformations with extra nonlinear displacements. A set of anchors around the mesh surface is introduced to guide the learning of rigid transformation matrices. Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning. By explicitly constraining the transformed anchors to satisfy the consistencies of position, normal and direction, the physical meaning of learned anchor transformations in space is guaranteed for better generalization. Furthermore, an adaptive anchor updating is proposed to optimize the anchor position by being aware of local mesh topology for learning representative anchor transformations. Qualitative and quantitative experiments on different types of garments demonstrate that AnchorDEF achieves the state-of-the-art performance on 3D garment deformation prediction in motion, especially for loose-fitting garments.

ROSep 15, 2023
OccupancyDETR: Using DETR for Mixed Dense-sparse 3D Occupancy Prediction

Yupeng Jia, Jie He, Runze Chen et al.

Visual-based 3D semantic occupancy perception is a key technology for robotics, including autonomous vehicles, offering an enhanced understanding of the environment by 3D. This approach, however, typically requires more computational resources than BEV or 2D methods. We propose a novel 3D semantic occupancy perception method, OccupancyDETR, which utilizes a DETR-like object detection, a mixed dense-sparse 3D occupancy decoder. Our approach distinguishes between foreground and background within a scene. Initially, foreground objects are detected using the DETR-like object detection. Subsequently, queries for both foreground and background objects are fed into the mixed dense-sparse 3D occupancy decoder, performing upsampling in dense and sparse methods, respectively. Finally, a MaskFormer is utilized to infer the semantics of the background voxels. Our approach strikes a balance between efficiency and accuracy, achieving faster inference times, lower resource consumption, and improved performance for small object detection. We demonstrate the effectiveness of our proposed method on the SemanticKITTI dataset, showcasing an mIoU of 14 and a processing speed of 10 FPS, thereby presenting a promising solution for real-time 3D semantic occupancy perception.

CVAug 23, 2024
Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge

Mingyu Xiao, Runze Chen, Haiyong Luo et al.

Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.

CVNov 10, 2025Code
HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Ruijia Wu, Ping Chen, Fei Shen et al.

Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content.To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness.These components work in concert to produce structured, cognitively-aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions. The code is available at https://github.com/UnicomAI/HiMo-CLIP.

CVOct 30, 2025Code
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao, Ping Chen, Fuyuan Shi et al.

We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

CVSep 13, 2024
CSS: Overcoming Pose and Scene Challenges in Crowd-Sourced 3D Gaussian Splatting

Runze Chen, Mingyu Xiao, Haiyong Luo et al.

We introduce Crowd-Sourced Splatting (CSS), a novel 3D Gaussian Splatting (3DGS) pipeline designed to overcome the challenges of pose-free scene reconstruction using crowd-sourced imagery. The dream of reconstructing historically significant but inaccessible scenes from collections of photographs has long captivated researchers. However, traditional 3D techniques struggle with missing camera poses, limited viewpoints, and inconsistent lighting. CSS addresses these challenges through robust geometric priors and advanced illumination modeling, enabling high-quality novel view synthesis under complex, real-world conditions. Our method demonstrates clear improvements over existing approaches, paving the way for more accurate and flexible applications in AR, VR, and large-scale 3D reconstruction.

CVDec 28, 2025
OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang, Zhao Wang, Zhou Wei et al.

3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at https://why-102.github.io/openground.io/.

CVDec 23, 2024Code
Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection

Fenfang Tao, Guo-Sen Xie, Fang Zhao et al.

Few-shot anomaly detection (FSAD) aims to detect unseen anomaly regions with the guidance of very few normal support images from the same class. Existing FSAD methods usually find anomalies by directly designing complex text prompts to align them with visual features under the prevailing large vision-language model paradigm. However, these methods, almost always, neglect intrinsic contextual information in visual features, e.g., the interaction relationships between different vision layers, which is an important clue for detecting anomalies comprehensively. To this end, we propose a kernel-aware graph prompt learning framework, termed as KAG-prompt, by reasoning the cross-layer relations among visual features for FSAD. Specifically, a kernel-aware hierarchical graph is built by taking the different layer features focusing on anomalous regions of different sizes as nodes, meanwhile, the relationships between arbitrary pairs of nodes stand for the edges of the graph. By message passing over this graph, KAG-prompt can capture cross-layer contextual information, thus leading to more accurate anomaly prediction. Moreover, to integrate the information of multiple important anomaly signals in the prediction map, we propose a novel image-level scoring method based on multi-level information fusion. Extensive experiments on MVTecAD and VisA datasets show that KAG-prompt achieves state-of-the-art FSAD results for image-level/pixel-level anomaly detection. Code is available at https://github.com/CVL-hub/KAG-prompt.git.

CVDec 23, 2024Code
AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation

Jiaqi Ma, Guo-Sen Xie, Fang Zhao et al.

Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5\textsuperscript{i} and COCO-20\textsuperscript{i} datasets demonstrate that AFANet has achieved state-of-the-art performance. The code is available at https://github.com/jarch-ma/AFANet.

64.6AIMay 14
MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

Shaoan Zhao, Huanlin Gao, Qiang Hui et al.

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

CVAug 22, 2025Code
Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

Jiaqi Ma, Guo-Sen Xie, Fang Zhao et al.

Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.

ROJun 12, 2021Code
Lvio-Fusion: A Self-adaptive Multi-sensor Fusion SLAM Framework Using Actor-critic Method

Yupeng Jia, Haiyong Luo, Fang Zhao et al.

State estimation with sensors is essential for mobile robots. Due to different performance of sensors in different environments, how to fuse measurements of various sensors is a problem. In this paper, we propose a tightly coupled multi-sensor fusion framework, Lvio-Fusion, which fuses stereo camera, Lidar, IMU, and GPS based on the graph optimization. Especially for urban traffic scenes, we introduce a segmented global pose graph optimization with GPS and loop-closure, which can eliminate accumulated drifts. Additionally, we creatively use a actor-critic method in reinforcement learning to adaptively adjust sensors' weight. After training, actor-critic agent can provide the system better and dynamic sensors' weight. We evaluate the performance of our system on public datasets and compare it with other state-of-the-art methods, which shows that the proposed method achieves high estimation accuracy and robustness to various environments. And our implementations are open source and highly scalable.

CVJun 11, 2020Code
Attentive WaveBlock: Complementarity-enhanced Mutual Networks for Unsupervised Domain Adaptation in Person Re-identification and Beyond

Wenhao Wang, Fang Zhao, Shengcai Liao et al.

Unsupervised domain adaptation (UDA) for person re-identification is challenging because of the huge gap between the source and target domain. A typical self-training method is to use pseudo-labels generated by clustering algorithms to iteratively optimize the model on the target domain. However, a drawback to this is that noisy pseudo-labels generally cause trouble in learning. To address this problem, a mutual learning method by dual networks has been developed to produce reliable soft labels. However, as the two neural networks gradually converge, their complementarity is weakened and they likely become biased towards the same kind of noise. This paper proposes a novel light-weight module, the Attentive WaveBlock (AWB), which can be integrated into the dual networks of mutual learning to enhance the complementarity and further depress noise in the pseudo-labels. Specifically, we first introduce a parameter-free module, the WaveBlock, which creates a difference between features learned by two networks by waving blocks of feature maps differently. Then, an attention mechanism is leveraged to enlarge the difference created and discover more complementary features. Furthermore, two kinds of combination strategies, i.e. pre-attention and post-attention, are explored. Experiments demonstrate that the proposed method achieves state-of-the-art performance with significant improvements on multiple UDA person re-identification tasks. We also prove the generality of the proposed method by applying it to vehicle re-identification and image classification tasks. Our codes and models are available at https://github.com/WangWenhao0716/Attentive-WaveBlock.

70.6CVMar 28
MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

Xiaofeng Tan, Wanjiang Weng, Hongsong Wang et al.

Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.

55.2CVMar 18
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao, Zhao Wang, Chenyang Si et al.

Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

CVApr 17, 2025
LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

Weijia Li, Guanglei Chu, Jiong Chen et al.

Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.

LGJan 27
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

Huanlin Gao, Ping Chen, Fuyuan Shi et al.

We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.

CVAug 18, 2025
Foundation Model for Skeleton-Based Human Action Understanding

Hongsong Wang, Wanjiang Weng, Junbo Wang et al.

Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.

NCFeb 24, 2025
Deep Learning-Powered Electrical Brain Signals Analysis: Advancing Neurological Diagnostics

Jiahe Li, Xin Chen, Fanqi Shen et al.

Neurological disorders pose major global health challenges, driving advances in brain signal analysis. Scalp electroencephalography (EEG) and intracranial EEG (iEEG) are widely used for diagnosis and monitoring. However, dataset heterogeneity and task variations hinder the development of robust deep learning solutions. This review systematically examines recent advances in deep learning approaches for EEG/iEEG-based neurological diagnostics, focusing on applications across 7 neurological conditions using 46 datasets. For each condition, we review representative methods and their quantitative results, integrating performance comparisons with analyses of data usage, model design, and task-specific adaptations, while highlighting the role of pre-trained multi-task models in achieving scalable, generalizable solutions. Finally, we propose a standardized benchmark to evaluate models across diverse datasets and improve reproducibility, emphasizing how recent innovations are transforming neurological diagnostics toward intelligent, adaptable healthcare systems.

CVNov 27, 2025
ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection

Runzhi Deng, Yundi Hu, Xinshuang Zhang et al.

Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.

AIJan 20, 2022
Fine-Grained Trajectory-based Travel Time Estimation for Multi-city Scenarios Based on Deep Meta-Learning

Chenxing Wang, Fang Zhao, Haichao Zhang et al.

Travel Time Estimation (TTE) is indispensable in intelligent transportation system (ITS). It is significant to achieve the fine-grained Trajectory-based Travel Time Estimation (TTTE) for multi-city scenarios, namely to accurately estimate travel time of the given trajectory for multiple city scenarios. However, it faces great challenges due to complex factors including dynamic temporal dependencies and fine-grained spatial dependencies. To tackle these challenges, we propose a meta learning based framework, MetaTTE, to continuously provide accurate travel time estimation over time by leveraging well-designed deep neural network model called DED, which consists of Data preprocessing module and Encoder-Decoder network module. By introducing meta learning techniques, the generalization ability of MetaTTE is enhanced using small amount of examples, which opens up new opportunities to increase the potential of achieving consistent performance on TTTE when traffic conditions and road networks change over time in the future. The DED model adopts an encoder-decoder network to capture fine-grained spatial and temporal representations. Extensive experiments on two real-world datasets are conducted to confirm that our MetaTTE outperforms six state-of-art baselines, and improve 29.35% and 25.93% accuracy than the best baseline on Chengdu and Porto datasets, respectively.

CVJan 5, 2022
Biphasic Face Photo-Sketch Synthesis via Semantic-Driven Generative Adversarial Network with Graph Representation Learning

Xingqun Qi, Muyi Sun, Zijian Wang et al.

Biphasic face photo-sketch synthesis has significant practical value in wide-ranging fields such as digital entertainment and law enforcement. Previous approaches directly generate the photo-sketch in a global view, they always suffer from the low quality of sketches and complex photo variations, leading to unnatural and low-fidelity results. In this paper, we propose a novel Semantic-Driven Generative Adversarial Network to address the above issues, cooperating with Graph Representation Learning. Considering that human faces have distinct spatial structures, we first inject class-wise semantic layouts into the generator to provide style-based spatial information for synthesized face photos and sketches. Additionally, to enhance the authenticity of details in generated faces, we construct two types of representational graphs via semantic parsing maps upon input faces, dubbed the IntrA-class Semantic Graph (IASG) and the InteR-class Structure Graph (IRSG). Specifically, the IASG effectively models the intra-class semantic correlations of each facial semantic component, thus producing realistic facial details. To preserve the generated faces being more structure-coordinated, the IRSG models inter-class structural relations among every facial component by graph representation learning. To further enhance the perceptual quality of synthesized images, we present a biphasic interactive cycle training strategy by fully taking advantage of the multi-level feature consistency between the photo and sketch. Extensive experiments demonstrate that our method outperforms the state-of-the-art competitors on the CUFS and CUFSF datasets.

LGDec 6, 2021
Spatio-Temporal meets Wavelet: Disentangled Traffic Flow Forecasting via Efficient Spectral Graph Attention Network

Yuchen Fang, Yanjun Qin, Haiyong Luo et al.

Traffic forecasting is crucial for public safety and resource optimization, yet is very challenging due to three aspects: i) current existing works mostly exploit intricate temporal patterns (e.g., the short-term thunderstorm and long-term daily trends) within a single method, which fail to accurately capture spatio-temporal dependencies under different schemas; ii) the under-exploration of the graph positional encoding limit the extraction of spatial information in the commonly used full graph attention network; iii) the quadratic complexity of the full graph attention introduces heavy computational needs. To achieve the effective traffic flow forecasting, we propose an efficient spectral graph attention network with disentangled traffic sequences. Specifically, the discrete wavelet transform is leveraged to obtain the low- and high-frequency components of traffic sequences, and a dual-channel encoder is elaborately designed to accurately capture the spatio-temporal dependencies under long- and short-term schemas of the low- and high-frequency components. Moreover, a novel wavelet-based graph positional encoding and a query sampling strategy are introduced in our spectral graph attention to effectively guide message passing and efficiently calculate the attention. Extensive experiments on four real-world datasets show the superiority of our model, i.e., the higher traffic forecasting precision with lower computational cost.

LGDec 6, 2021
CDGNet: A Cross-Time Dynamic Graph-based Deep Learning Model for Traffic Forecasting

Yuchen Fang, Yanjun Qin, Haiyong Luo et al.

Traffic forecasting is important in intelligent transportation systems of webs and beneficial to traffic safety, yet is very challenging because of the complex and dynamic spatio-temporal dependencies in real-world traffic systems. Prior methods use the pre-defined or learnable static graph to extract spatial correlations. However, the static graph-based methods fail to mine the evolution of the traffic network. Researchers subsequently generate the dynamic graph for each time slice to reflect the changes of spatial correlations, but they follow the paradigm of independently modeling spatio-temporal dependencies, ignoring the cross-time spatial influence. In this paper, we propose a novel cross-time dynamic graph-based deep learning model, named CDGNet, for traffic forecasting. The model is able to effectively capture the cross-time spatial dependence between each time slice and its historical time slices by utilizing the cross-time dynamic graph. Meanwhile, we design a gating mechanism to sparse the cross-time dynamic graph, which conforms to the sparse spatial correlations in the real world. Besides, we propose a novel encoder-decoder architecture to incorporate the cross-time dynamic graph-based GCN for multi-step traffic forecasting. Experimental results on three real-world public traffic datasets demonstrate that CDGNet outperforms the state-of-the-art baselines. We additionally provide a qualitative study to analyze the effectiveness of our architecture.

LGDec 4, 2021
DMGCRN: Dynamic Multi-Graph Convolution Recurrent Network for Traffic Forecasting

Yanjun Qin, Yuchen Fang, Haiyong Luo et al.

Traffic forecasting is a problem of intelligent transportation systems (ITS) and crucial for individuals and public agencies. Therefore, researches pay great attention to deal with the complex spatio-temporal dependencies of traffic system for accurate forecasting. However, there are two challenges: 1) Most traffic forecasting studies mainly focus on modeling correlations of neighboring sensors and ignore correlations of remote sensors, e.g., business districts with similar spatio-temporal patterns; 2) Prior methods which use static adjacency matrix in graph convolutional networks (GCNs) are not enough to reflect the dynamic spatial dependence in traffic system. Moreover, fine-grained methods which use self-attention to model dynamic correlations of all sensors ignore hierarchical information in road networks and have quadratic computational complexity. In this paper, we propose a novel dynamic multi-graph convolution recurrent network (DMGCRN) to tackle above issues, which can model the spatial correlations of distance, the spatial correlations of structure, and the temporal correlations simultaneously. We not only use the distance-based graph to capture spatial information from nodes are close in distance but also construct a novel latent graph which encoded the structure correlations among roads to capture spatial information from nodes are similar in structure. Furthermore, we divide the neighbors of each sensor into coarse-grained regions, and dynamically assign different weights to each region at different times. Meanwhile, we integrate the dynamic multi-graph convolution network into the gated recurrent unit (GRU) to capture temporal dependence. Extensive experiments on three real-world traffic datasets demonstrate that our proposed algorithm outperforms state-of-the-art baselines.

LGDec 4, 2021
STJLA: A Multi-Context Aware Spatio-Temporal Joint Linear Attention Network for Traffic Forecasting

Yuchen Fang, Yanjun Qin, Haiyong Luo et al.

Traffic prediction has gradually attracted the attention of researchers because of the increase in traffic big data. Therefore, how to mine the complex spatio-temporal correlations in traffic data to predict traffic conditions more accurately become a difficult problem. Previous works combined graph convolution networks (GCNs) and self-attention mechanism with deep time series models (e.g. recurrent neural networks) to capture the spatio-temporal correlations separately, ignoring the relationships across time and space. Besides, GCNs are limited by over-smoothing issue and self-attention is limited by quadratic problem, result in GCNs lack global representation capabilities, and self-attention inefficiently capture the global spatial dependence. In this paper, we propose a novel deep learning model for traffic forecasting, named Multi-Context Aware Spatio-Temporal Joint Linear Attention (STJLA), which applies linear attention to the spatio-temporal joint graph to capture global dependence between all spatio-temporal nodes efficiently. More specifically, STJLA utilizes static structural context and dynamic semantic context to improve model performance. The static structure context based on node2vec and one-hot encoding enriches the spatio-temporal position information. Furthermore, the multi-head diffusion convolution network based dynamic spatial context enhances the local spatial perception ability, and the GRU based dynamic temporal context stabilizes sequence position information of the linear attention, respectively. Experiments on two real-world traffic datasets, England and PEMSD7, demonstrate that our STJLA can achieve up to 9.83% and 3.08% accuracy improvement in MAE measure over state-of-the-art baselines.

CVAug 19, 2021
Learning Anchored Unsigned Distance Functions with Gradient Direction Alignment for Single-view Garment Reconstruction

Fang Zhao, Wenhao Wang, Shengcai Liao et al.

While single-view 3D reconstruction has made significant progress benefiting from deep shape representations in recent years, garment reconstruction is still not solved well due to open surfaces, diverse topologies and complex geometric details. In this paper, we propose a novel learnable Anchored Unsigned Distance Function (AnchorUDF) representation for 3D garment reconstruction from a single image. AnchorUDF represents 3D shapes by predicting unsigned distance fields (UDFs) to enable open garment surface modeling at arbitrary resolution. To capture diverse garment topologies, AnchorUDF not only computes pixel-aligned local image features of query points, but also leverages a set of anchor points located around the surface to enrich 3D position features for query points, which provides stronger 3D space context for the distance function. Furthermore, in order to obtain more accurate point projection direction at inference, we explicitly align the spatial gradient direction of AnchorUDF with the ground-truth direction to the surface during training. Extensive experiments on two public 3D garment datasets, i.e., MGN and Deep Fashion3D, demonstrate that AnchorUDF achieves the state-of-the-art performance on single-view garment reconstruction.

LGJul 6, 2021
A Light-weight Deep Human Activity Recognition Algorithm Using Multi-knowledge Distillation

Runze Chen, Haiyong Luo, Fang Zhao et al.

Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.

CVMay 9, 2021
Beyond Monocular Deraining: Parallel Stereo Deraining Network Via Semantic Prior

Kaihao Zhang, Wenhan Luo, Yanjiang Yu et al.

Rain is a common natural phenomenon. Taking images in the rain however often results in degraded quality of images, thus compromises the performance of many computer vision systems. Most existing de-rain algorithms use only one single input image and aim to recover a clean image. Few work has exploited stereo images. Moreover, even for single image based monocular deraining, many current methods fail to complete the task satisfactorily because they mostly rely on per pixel loss functions and ignore semantic information. In this paper, we present a Paired Rain Removal Network (PRRNet), which exploits both stereo images and semantic information. Specifically, we develop a Semantic-Aware Deraining Module (SADM) which solves both tasks of semantic segmentation and deraining of scenes, and a Semantic-Fusion Network (SFNet) and a View-Fusion Network (VFNet) which fuse semantic information and multi-view information respectively. In addition, we also introduce an Enhanced Paired Rain Removal Network (EPRRNet) which exploits semantic prior to remove rain streaks from stereo images. We first use a coarse deraining network to reduce the rain streaks on the input images, and then adopt a pre-trained semantic segmentation network to extract semantic features from the coarse derained image. Finally, a parallel stereo deraining network fuses semantic and multi-view information to restore finer results. We also propose new stereo based rainy datasets for benchmarking. Experiments on both monocular and the newly proposed stereo rainy datasets demonstrate that the proposed method achieves the state-of-the-art performance.

CVNov 24, 2020
DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations

Wenhao Wang, Shengcai Liao, Fang Zhao et al.

Existing person re-identification models often have low generalizability, which is mostly due to limited availability of large-scale labeled data in training. However, labeling large-scale training data is very expensive and time-consuming, while large-scale synthetic dataset shows promising value in learning generalizable person re-identification models. Therefore, in this paper a novel and practical person re-identification task is proposed,i.e. how to use labeled synthetic dataset and unlabeled real-world dataset to train a universal model. In this way, human annotations are no longer required, and it is scalable to large and diverse real-world datasets. To address the task, we introduce a framework with high generalizability, namely DomainMix. Specifically, the proposed method firstly clusters the unlabeled real-world images and selects the reliable clusters. During training, to address the large domain gap between two domains, a domain-invariant feature learning method is proposed, which introduces a new loss,i.e. domain balance loss, to conduct an adversarial learning between domain-invariant feature learning and domain discrimination, and meanwhile learns a discriminative feature for person re-identification. This way, the domain gap between synthetic and real-world data is much reduced, and the learned feature is generalizable thanks to the large-scale and diverse training data. Experimental results show that the proposed annotation-free method is more or less comparable to the counterpart trained with full human annotations, which is quite promising. In addition, it achieves the current state of the art on several person re-identification datasets under direct cross-dataset evaluation.

CVFeb 13, 2019
Multi-Prototype Networks for Unconstrained Set-based Face Recognition

Jian Zhao, Jianshu Li, Xiaoguang Tu et al.

In this paper, we study the challenging unconstrained set-based face recognition problem where each subject face is instantiated by a set of media (images and videos) instead of a single image. Naively aggregating information from all the media within a set would suffer from the large intra-set variance caused by heterogeneous factors (e.g., varying media modalities, poses and illuminations) and fail to learn discriminative face representations. A novel Multi-Prototype Network (MPNet) model is thus proposed to learn multiple prototype face representations adaptively from the media sets. Each learned prototype is representative for the subject face under certain condition in terms of pose, illumination and media modality. Instead of handcrafting the set partition for prototype learning, MPNet introduces a Dense SubGraph (DSG) learning sub-net that implicitly untangles inconsistent media and learns a number of representative prototypes. Qualitative and quantitative experiments clearly demonstrate superiority of the proposed model over state-of-the-arts.

CVSep 2, 2018
Look Across Elapse: Disentangled Representation Learning and Photorealistic Cross-Age Face Synthesis for Age-Invariant Face Recognition

Jian Zhao, Yu Cheng, Yi Cheng et al.

Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, we develop effective and novel training strategies for end-to-end learning the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Moreover, we propose a new large-scale Cross-Age Face Recognition (CAFR) benchmark dataset to facilitate existing efforts and push the frontiers of age-invariant face recognition research. Extensive experiments on both our CAFR and several other cross-age datasets (MORPH, CACD and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on one of the most popular unconstrained face recognition datasets IJB-C additionally verifies the promising generalizability of AIM in recognizing faces in the wild.

CVNov 16, 2017
Integrated Face Analytics Networks through Cross-Dataset Hybrid Training

Jianshu Li, Shengtao Xiao, Fang Zhao et al.

Face analytics benefits many multimedia applications. It consists of a number of tasks, such as facial emotion recognition and face parsing, and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows "plug-in and play" of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model.

CVDec 27, 2016
Robust LSTM-Autoencoders for Face De-Occlusion in the Wild

Fang Zhao, Jiashi Feng, Jian Zhao et al.

Face recognition techniques have been developed significantly in recent years. However, recognizing faces with partial occlusion is still challenging for existing face recognizers which is heavily desired in real-world applications concerning surveillance and security. Although much research effort has been devoted to developing face de-occlusion methods, most of them can only work well under constrained conditions, such as all the faces are from a pre-defined closed set. In this paper, we propose a robust LSTM-Autoencoders (RLA) model to effectively restore partially occluded faces even in the wild. The RLA model consists of two LSTM components, which aims at occlusion-robust face encoding and recurrent occlusion removal respectively. The first one, named multi-scale spatial LSTM encoder, reads facial patches of various scales sequentially to output a latent representation, and occlusion-robustness is achieved owing to the fact that the influence of occlusion is only upon some of the patches. Receiving the representation learned by the encoder, the LSTM decoder with a dual channel architecture reconstructs the overall face and detects occlusion simultaneously, and by feat of LSTM, the decoder breaks down the task of face de-occlusion into restoring the occluded part step by step. Moreover, to minimize identify information loss and guarantee face recognition accuracy over recovered faces, we introduce an identity-preserving adversarial training scheme to further improve RLA. Extensive experiments on both synthetic and real datasets of faces with occlusion clearly demonstrate the effectiveness of our proposed RLA in removing different types of facial occlusion at various locations. The proposed method also provides significantly larger performance gain than other de-occlusion methods in promoting recognition performance over partially-occluded faces.

CVApr 29, 2016
Deep Edge Guided Recurrent Residual Learning for Image Super-Resolution

Wenhan Yang, Jiashi Feng, Jianchao Yang et al.

In this work, we consider the image super-resolution (SR) problem. The main challenge of image SR is to recover high-frequency details of a low-resolution (LR) image that are important for human perception. To address this essentially ill-posed problem, we introduce a Deep Edge Guided REcurrent rEsidual~(DEGREE) network to progressively recover the high-frequency details. Different from most of existing methods that aim at predicting high-resolution (HR) images directly, DEGREE investigates an alternative route to recover the difference between a pair of LR and HR images by recurrent residual learning. DEGREE further augments the SR process with edge-preserving capability, namely the LR image and its edge map can jointly infer the sharp edge details of the HR image during the recurrent recovery process. To speed up its training convergence rate, by-pass connections across multiple layers of DEGREE are constructed. In addition, we offer an understanding on DEGREE from the view-point of sub-band frequency decomposition on image signal and experimentally demonstrate how DEGREE can recover different frequency bands separately. Extensive experiments on three benchmark datasets clearly demonstrate the superiority of DEGREE over well-established baselines and DEGREE also provides new state-of-the-arts on these datasets.

CVJan 26, 2015
Deep Semantic Ranking Based Hashing for Multi-Label Image Retrieval

Fang Zhao, Yongzhen Huang, Liang Wang et al.

With the rapid growth of web images, hashing has received increasing interests in large scale image retrieval. Research efforts have been devoted to learning compact binary codes that preserve semantic similarity based on labels. However, most of these hashing methods are designed to handle simple binary similarity. The complex multilevel semantic structure of images associated with multiple labels have not yet been well explored. Here we propose a deep semantic ranking based method for learning hash functions that preserve multilevel semantic similarity between multi-label images. In our approach, deep convolutional neural network is incorporated into hash functions to jointly learn feature representations and mappings from them to hash codes, which avoids the limitation of semantic representation power of hand-crafted features. Meanwhile, a ranking list that encodes the multilevel similarity information is employed to guide the learning of such deep hash functions. An effective scheme based on surrogate loss is used to solve the intractable optimization problem of nonsmooth and multivariate ranking measures involved in the learning procedure. Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods in term of ranking evaluation metrics when tested on multi-label image datasets.

AIJul 11, 2012
A GPS Pseudorange Based Cooperative Vehicular Distance Measurement Technique

Daiqin Yang, Fang Zhao, Kai Liu et al.

Accurate vehicular localization is important for various cooperative vehicle safety (CVS) applications such as collision avoidance, turning assistant, etc. In this paper, we propose a cooperative vehicular distance measurement technique based on the sharing of GPS pseudorange measurements and a weighted least squares method. The classic double difference pseudorange solution, which was originally designed for high-end survey level GPS systems, is adapted to low-end navigation level GPS receivers for its wide availability in ground vehicles. The Carrier to Noise Ratio (CNR) of raw pseudorange measurements are taken into account for noise mitigation. We present a Dedicated Short Range Communications (DSRC) based mechanism to implement the exchange of pseudorange information among neighboring vehicles. As demonstrated in field tests, our proposed technique increases the accuracy of the distance measurement significantly compared with the distance obtained from the GPS fixes.