Zhizhong Zhang

CV
h-index40
49papers
1,973citations
Novelty55%
AI Score62

49 Papers

CVMay 9, 2024
Robust Pseudo-label Learning with Neighbor Relation for Unsupervised Visible-Infrared Person Re-Identification

Xiangbo Yin, Jiangming Shi, Yachao Zhang et al.

Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) presents a formidable challenge, which aims to match pedestrian images across visible and infrared modalities without any annotations. Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle. Most existing works primarily focus on shielding the model from the harmful effects of noise, neglecting to calibrate noisy pseudo-labels usually associated with hard samples, which will compromise the robustness of the model. To address this issue, we design a Robust Pseudo-label Learning with Neighbor Relation (RPNR) framework for USVI-ReID. To be specific, we first introduce a straightforward yet potent Noisy Pseudo-label Calibration module to correct noisy pseudo-labels. Due to the high intra-class variations, noisy pseudo-labels are difficult to calibrate completely. Therefore, we introduce a Neighbor Relation Learning module to reduce high intra-class variations by modeling potential interactions between all samples. Subsequently, we devise an Optimal Transport Prototype Matching module to establish reliable cross-modality correspondences. On that basis, we design a Memory Hybrid Learning module to jointly learn modality-specific and modality-invariant information. Comprehensive experiments conducted on two widely recognized benchmarks, SYSU-MM01 and RegDB, demonstrate that RPNR outperforms the current state-of-the-art GUR with an average Rank-1 improvement of 10.3%. The source codes will be released soon.

CVMay 11, 2022
NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

Yawei Li, Kai Zhang, Radu Timofte et al. · eth-zurich, tencent-ai

This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.

CVFeb 7, 2023Code
High-Resolution GAN Inversion for Degraded Images in Large Diverse Datasets

Yanbo Wang, Chuming Lin, Donghao Luo et al. · tencent-ai

The last decades are marked by massive and diverse image data, which shows increasingly high resolution and quality. However, some images we obtained may be corrupted, affecting the perception and the application of downstream tasks. A generic method for generating a high-quality image from the degraded one is in demand. In this paper, we present a novel GAN inversion framework that utilizes the powerful generative ability of StyleGAN-XL for this problem. To ease the inversion challenge with StyleGAN-XL, Clustering \& Regularize Inversion (CRI) is proposed. Specifically, the latent space is firstly divided into finer-grained sub-spaces by clustering. Instead of initializing the inversion with the average latent vector, we approximate a centroid latent vector from the clusters, which generates an image close to the input image. Then, an offset with a regularization term is introduced to keep the inverted latent vector within a certain range. We validate our CRI scheme on multiple restoration tasks (i.e., inpainting, colorization, and super-resolution) of complex natural images, and show preferable quantitative and qualitative results. We further demonstrate our technique is robust in terms of data and different GAN models. To our best knowledge, we are the first to adopt StyleGAN-XL for generating high-quality natural images from diverse degraded inputs. Code is available at https://github.com/Booooooooooo/CRI.

CVNov 22, 2023Code
Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Zhen Zhao, Jingqun Tang, Chunhui Lin et al.

Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedance/E2STR .

CVAug 3, 2023Code
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment

Zhiwei Zhang, Zhizhong Zhang, Qian Yu et al.

3D panoptic segmentation is a challenging perception task that requires both semantic segmentation and instance segmentation. In this task, we notice that images could provide rich texture, color, and discriminative information, which can complement LiDAR data for evident performance improvement, but their fusion remains a challenging problem. To this end, we propose LCPS, the first LiDAR-Camera Panoptic Segmentation network. In our approach, we conduct LiDAR-Camera fusion in three stages: 1) an Asynchronous Compensation Pixel Alignment (ACPA) module that calibrates the coordinate misalignment caused by asynchronous problems between sensors; 2) a Semantic-Aware Region Alignment (SARA) module that extends the one-to-one point-pixel mapping to one-to-many semantic relations; 3) a Point-to-Voxel feature Propagation (PVP) module that integrates both geometric and semantic fusion information for the entire point cloud. Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset. Extensive quantitative and qualitative experiments further demonstrate the effectiveness of our novel framework. The code will be released at https://github.com/zhangzw12319/lcps.git.

CVJul 23, 2024Code
Harmonizing Visual Text Comprehension and Generation

Zhen Zhao, Jingqun Tang, Binghong Wu et al.

In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries. Code is available at https://github.com/bytedance/TextHarmony.

CVJun 20, 2022
Variational Distillation for Multi-View Learning

Xudong Tian, Zhizhong Zhang, Cong Wang et al.

Information Bottleneck (IB) based multi-view learning provides an information theoretic principle for seeking shared information contained in heterogeneous data descriptions. However, its great success is generally attributed to estimate the multivariate mutual information which is intractable when the network becomes complicated. Moreover, the representation learning tradeoff, {\it i.e.}, prediction-compression and sufficiency-consistency tradeoff, makes the IB hard to satisfy both requirements simultaneously. In this paper, we design several variational information bottlenecks to exploit two key characteristics ({\it i.e.}, sufficiency and consistency) for multi-view representation learning. Specifically, we propose a Multi-View Variational Distillation (MV$^2$D) strategy to provide a scalable, flexible and analytical solution to fitting MI by giving arbitrary input of viewpoints but without explicitly estimating it. Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels, producing predictive and compact representations naturally. Also, our information-theoretic constraint can effectively neutralize the sensitivity to heterogeneous data by eliminating both task-irrelevant and view-specific information, preventing both tradeoffs in multiple view cases. To verify our theoretically grounded strategies, we apply our approaches to various benchmarks under three different applications. Extensive experiments to quantitatively and qualitatively demonstrate the effectiveness of our approach against state-of-the-art methods.

ITMay 28
User-Centric Clustering for uRLLC in Cell-Free RAN via Extreme Value Theory

Yu Zhang, Xinyue Yang, Dongming Wang et al.

Ultra-reliable low-latency communication (uRLLC) is a pivotal enabler for B5G/6G networks, yet it faces severe challenges from rare but critical extreme events, which are characterized by heavy tails in the delay distribution. While the cell-free radio access network (CF-RAN) architecture offers essential spatial diversity to combat these uncertainties, conventional user-centric clustering designs typically focus on average metrics, thereby inadequately addressing such tail behaviors. We propose a novel, tail-risk-aware, user-centric clustering framework operating within the finite blocklength (FBL) regime. Our approach employs extreme value theory (EVT), specifically the peaks-over-threshold (POT) model, to accurately quantify the probability of queue latency violations. This framework is applied to formulate an energy efficiency (EE) maximization problem under strict tail latency constraints. The problem is solved via an efficient online algorithm that integrates Lyapunov optimization with successive convex approximation (SCA). Simulation results demonstrate that the proposed scheme, through its dynamic adaptation of cluster formation to mitigate tail risks, achieves a superior reliability-efficiency trade-off and leads to a significant suppression of extreme latency events.

CVSep 16, 2022
Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning

Tianfang Sun, Zhizhong Zhang, Xin Tan et al.

Weakly supervised point cloud semantic segmentation methods that require 1\% or fewer labels, hoping to realize almost the same performance as fully supervised approaches, which recently, have attracted extensive research attention. A typical solution in this framework is to use self-training or pseudo labeling to mine the supervision from the point cloud itself, but ignore the critical information from images. In fact, cameras widely exist in LiDAR scenarios and this complementary information seems to be greatly important for 3D applications. In this paper, we propose a novel cross-modality weakly supervised method for 3D segmentation, incorporating complementary information from unlabeled images. Basically, we design a dual-branch network equipped with an active labeling strategy, to maximize the power of tiny parts of labels and directly realize 2D-to-3D knowledge transfer. Afterwards, we establish a cross-modal self-training framework in an Expectation-Maximum (EM) perspective, which iterates between pseudo labels estimation and parameters updating. In the M-Step, we propose a cross-modal association learning to mine complementary supervision from images by reinforcing the cycle-consistency between 3D points and 2D superpixels. In the E-step, a pseudo label self-rectification mechanism is derived to filter noise labels thus providing more accurate labels for the networks to get fully trained. The extensive experimental results demonstrate that our method even outperforms the state-of-the-art fully supervised competitors with less than 1\% actively selected annotations.

CVAug 30, 2022
Boosting Night-time Scene Parsing with Learnable Frequency

Zhifeng Xie, Sen Wang, Ke Xu et al.

Night-Time Scene Parsing (NTSP) is essential to many vision applications, especially for autonomous driving. Most of the existing methods are proposed for day-time scene parsing. They rely on modeling pixel intensity-based spatial contextual cues under even illumination. Hence, these methods do not perform well in night-time scenes as such spatial contextual cues are buried in the over-/under-exposed regions in night-time scenes. In this paper, we first conduct an image frequency-based statistical experiment to interpret the day-time and night-time scene discrepancies. We find that image frequency distributions differ significantly between day-time and night-time scenes, and understanding such frequency distributions is critical to NTSP problem. Based on this, we propose to exploit the image frequency distributions for night-time scene parsing. First, we propose a Learnable Frequency Encoder (LFE) to model the relationship between different frequency coefficients to measure all frequency components dynamically. Second, we propose a Spatial Frequency Fusion module (SFF) that fuses both spatial and frequency information to guide the extraction of spatial context features. Extensive experiments show that our method performs favorably against the state-of-the-art methods on the NightCity, NightCity+ and BDD100K-night datasets. In addition, we demonstrate that our method can be applied to existing day-time scene parsing methods and boost their performance on night-time scenes.

ROMay 28
Structured interactions improve distributed coordination beyond model scaling in a real-world multi-robot system

Junping Wang, Zhizhong Zhang, Yongqiang Tang et al.

Scaling individual robot capabilities is common but costly. Here we investigate a system-level design question in real-world multi-robot coordination: given matched hardware budgets, does restructuring communication among robots yield larger gains than increasing onboard model size? Using a representative transport-and-mapping task with 10 physical robots (5 runs per condition, 60 runs total), we find that switching from fully connected to modular hierarchical interactions improves normalised performance by 47 points (0--100), whereas doubling neural network hidden size yields at most 9 points. Nested mixed-effects model comparisons show a substantially larger improvement in model fit for topology than for scale. The pattern is confirmed in independent SMAC replications; heterogeneous benchmark reanalyses provide secondary supporting consistency checks rather than primary evidence. Performance saturation beyond 1024 hidden units is observed in simulation-calibrated extrapolation, not directly on hardware. These results indicate that interaction structure can play a dominant role within the tested system and task setting, while broader quantitative generalisation remains to be established.

CVJul 17, 2024
Mutual Information Guided Optimal Transport for Unsupervised Visible-Infrared Person Re-identification

Zhizhong Zhang, Jiangming Wang, Xin Tan et al.

Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. In this paper, we first deduce an optimization objective for unsupervised VI-ReID based on the mutual information between the model's cross-modality input and output. With equivalent derivation, three learning principles, i.e., "Sharpness" (entropy minimization), "Fairness" (uniform label distribution), and "Fitness" (reliable cross-modality matching) are obtained. Under their guidance, we design a loop iterative training strategy alternating between model training and cross-modality matching. In the matching stage, a uniform prior guided optimal transport assignment ("Fitness", "Fairness") is proposed to select matched visible and infrared prototypes. In the training stage, we utilize this matching information to introduce prototype-based contrastive learning for minimizing the intra- and cross-modality entropy ("Sharpness"). Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 60.6% and 90.3% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations.

CVAug 31, 2022
Attentive pooling for Group Activity Recognition

Ding Li, Yuan Xie, Wensheng Zhang et al.

In group activity recognition, hierarchical framework is widely adopted to represent the relationships between individuals and their corresponding group, and has achieved promising performance. However, the existing methods simply employed max/average pooling in this framework, which ignored the distinct contributions of different individuals to the group activity recognition. In this paper, we propose a new contextual pooling scheme, named attentive pooling, which enables the weighted information transition from individual actions to group activity. By utilizing the attention mechanism, the attentive pooling is intrinsically interpretable and able to embed member context into the existing hierarchical model. In order to verify the effectiveness of the proposed scheme, two specific attentive pooling methods, i.e., global attentive pooling (GAP) and hierarchical attentive pooling (HAP) are designed. GAP rewards the individuals that are significant to group activity, while HAP further considers the hierarchical division by introducing subgroup structure. The experimental results on the benchmark dataset demonstrate that our proposal is significantly superior beyond the baseline and is comparable to the state-of-the-art methods.

LGApr 15
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Xiaofan Li, Ming Yang, Zhiyuan Ma et al.

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

CVJul 10, 2024
Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining

Tianfang Sun, Zhizhong Zhang, Xin Tan et al.

LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. However, two issues widely exist in this framework: 1) Solely keyframes are used for training. For example, in nuScenes, a substantial quantity of unpaired LiDAR and camera frames remain unutilized, limiting the representation capabilities of the pretrained network. 2) The contrastive loss erroneously distances points and image regions with identical semantics but from different frames, disturbing the semantic consistency of the learned presentations. In this paper, we propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames, enriching the original training set. We utilized timestamps and the semantic priors from VFMs to identify well-synchronized training pairs and to discover samples with diverse content. Moreover, we design a cross- and intra-modal conflict-aware contrastive loss using the semantic mask labels of VFMs to avoid contrasting semantically similar points and image regions. Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets: nuScenes, SemanticKITTI, and Waymo on 3D semantic segmentation by +3.0\%, +3.0\%, and +3.3\% in mIoU, respectively. Furthermore, our approach exhibits adaptable generalization to different 3D backbones and typical semantic masks generated by non-VFM models.

CVMar 15
S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction

Renhe Zhang, Yuyang Tan, Jingyu Gong et al.

Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

CVFeb 27, 2025Code
One-for-More: Continual Diffusion Model for Anomaly Detection

Xiaofan Li, Xin Tan, Zhuo Chen et al.

With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe ``faithfulness hallucination'' and ``catastrophic forgetting'', which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting'' to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/One-for-More

CVMay 12, 2024Code
Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

Haoming Chen, Zhizhong Zhang, Yanyun Qu et al.

An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at https://github.com/chenhaomingbob/CSC, hoping to inspire future research.

CVDec 3, 2024Code
Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Jingyu Gong, Chong Zhang, Fengqi Liu et al.

Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. Code will be publicly available at https://github.com/jingyugong/DIP.

CVMay 25, 2021Code
Towards Compact Single Image Super-Resolution via Contrastive Self-distillation

Yanbo Wang, Shaohui Lin, Yanyun Qu et al.

Convolutional neural networks (CNNs) are highly successful for super-resolution (SR) but often require sophisticated architectures with heavy memory cost and computational overhead, significantly restricts their practical deployments on resource-limited devices. In this paper, we proposed a novel contrastive self-distillation (CSD) framework to simultaneously compress and accelerate various off-the-shelf SR models. In particular, a channel-splitting super-resolution network can first be constructed from a target teacher network as a compact student network. Then, we propose a novel contrastive loss to improve the quality of SR images and PSNR/SSIM via explicit knowledge transfer. Extensive experiments demonstrate that the proposed CSD scheme effectively compresses and accelerates several standard SR models such as EDSR, RCAN and CARN. Code is available at https://github.com/Booooooooooo/CSD.

CVApr 19, 2021Code
Contrastive Learning for Compact Single Image Dehazing

Haiyan Wu, Yanyun Qu, Shaohui Lin et al.

Single image dehazing is a challenging ill-posed problem due to the severe information degeneration. However, existing deep learning based dehazing methods only adopt clear images as positive samples to guide the training of dehazing network while negative information is unexploited. Moreover, most of them focus on strengthening the dehazing network with an increase of depth and width, leading to a significant requirement of computation and memory. In this paper, we propose a novel contrastive regularization (CR) built upon contrastive learning to exploit both the information of hazy images and clear images as negative and positive samples, respectively. CR ensures that the restored image is pulled to closer to the clear image and pushed to far away from the hazy image in the representation space. Furthermore, considering trade-off between performance and memory storage, we develop a compact dehazing network based on autoencoder-like (AE) framework. It involves an adaptive mixup operation and a dynamic feature enhancement module, which can benefit from preserving information flow adaptively and expanding the receptive field to improve the network's transformation capability, respectively. We term our dehazing network with autoencoder and contrastive regularization as AECR-Net. The extensive experiments on synthetic and real-world datasets demonstrate that our AECR-Net surpass the state-of-the-art approaches. The code is released in https://github.com/GlassyWu/AECR-Net.

CVApr 8, 2024
PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

Xiaofan Li, Zhizhong Zhang, Xin Tan et al.

The vision-language model has brought great improvement to few-shot industrial anomaly detection, which usually needs to design of hundreds of prompts through prompt engineering. For automated scenarios, we first use conventional prompt learning with many-class paradigm as the baseline to automatically learn prompts but found that it can not work well in one-class anomaly detection. To address the above problem, this paper proposes a one-class prompt learning method for few-shot anomaly detection, termed PromptAD. First, we propose semantic concatenation which can transpose normal prompts into anomaly prompts by concatenating normal prompts with anomaly suffixes, thus constructing a large number of negative samples used to guide prompt learning in one-class setting. Furthermore, to mitigate the training challenge caused by the absence of anomaly images, we introduce the concept of explicit anomaly margin, which is used to explicitly control the margin between normal prompt features and anomaly prompt features through a hyper-parameter. For image-level/pixel-level anomaly detection, PromptAD achieves first place in 11/12 few-shot settings on MVTec and VisA.

CVDec 4, 2023
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Qihang Ma, Xin Tan, Yanyun Qu et al.

The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.

AIFeb 3
TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

Yu Cheng, Jiuan Zhou, Yongkang Hu et al.

Test-time evolution of agent memory serves as a pivotal paradigm for achieving AGI by bolstering complex reasoning through experience accumulation. However, even during benign task evolution, agent safety alignment remains vulnerable-a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark to assess multi-dimensional trustworthiness during benign task evolution, revealing an overall decline in trustworthiness across various task domains and evaluation settings. To address this issue, we propose TAME, a dual-memory evolutionary framework that separately evolves executor memory to improve task performance by distilling generalizable methodologies, and evaluator memory to refine assessments of both safety and task utility based on historical feedback. Through a closed loop of memory filtering, draft generation, trustworthy refinement, execution, and dual-track memory updating, TAME preserves trustworthiness without sacrificing utility. Experiments demonstrate that TAME mitigates misevolution, achieving a joint improvement in both trustworthiness and task performance.

CVApr 22
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

Ao Gao, Jingyu Gong, Xin Tan et al.

While 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering, its performance degrades significantly under sparse-view extrapolation, manifesting as severe geometric voids and artifacts. Existing solutions primarily rely on an iterative "Repair-then-Distill" paradigm, which is inherently unstable and prone to overfitting. In this work, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Our approach first synthesizes plausible 2D reference images and explicitly lifts them into metric-scale 3D primitives via a robust Stereo-Anchor mechanism. These primitives are then seamlessly integrated into the global context through a novel Ray-Constrained Registration strategy. This shift to a rapid registration paradigm delivers superior 3DGS completion performance across three distinct benchmarks, enhancing the quality and efficiency of various baselines and achieving new SOTA results.

CVJan 12, 2024
Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jiangming Shi, Xiangbo Yin, Yeyun Chen et al.

Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet challenging retrieval task. The key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, previous methods fell short of fully exploiting the individual nuances, as they simply utilized a single memory that represented an identity to establish cross-modality correspondences, resulting in ambiguous cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) module to narrow the modality gap while mitigating the effect of noise pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of our MMM. The source codes will be released.

CVMay 17, 2024
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

Xin Tan, Wenbin Wu, Zhiwei Zhang et al.

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.

CVMar 17, 2025
Hydra-MDP++: Advancing End-to-End Driving via Expert-Guided Hydra-Distillation

Kailin Li, Zhenxin Li, Shiyi Lan et al.

Hydra-MDP++ introduces a novel teacher-student knowledge distillation framework with a multi-head decoder that learns from human demonstrations and rule-based experts. Using a lightweight ResNet-34 network without complex components, the framework incorporates expanded evaluation metrics, including traffic light compliance (TL), lane-keeping ability (LK), and extended comfort (EC) to address unsafe behaviors not captured by traditional NAVSIM-derived teachers. Like other end-to-end autonomous driving approaches, \hydra processes raw images directly without relying on privileged perception signals. Hydra-MDP++ achieves state-of-the-art performance by integrating these components with a 91.0% drive score on NAVSIM through scaling to a V2-99 image encoder, demonstrating its effectiveness in handling diverse driving scenarios while maintaining computational efficiency.

CVFeb 29, 2024
Learning Commonality, Divergence and Variety for Unsupervised Visible-Infrared Person Re-identification

Jiangming Shi, Xiangbo Yin, Yachao Zhang et al.

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified people in infrared images to visible images without annotations, and vice versa. USVI-ReID is a challenging yet under-explored task. Most existing methods address the USVI-ReID using cluster-based contrastive learning, which simply employs the cluster center as a representation of a person. However, the cluster center primarily focuses on commonality, overlooking divergence and variety. To address the problem, we propose a Progressive Contrastive Learning with Hard and Dynamic Prototypes method for USVI-ReID. In brief, we generate the hard prototype by selecting the sample with the maximum distance from the cluster center. We theoretically show that the hard prototype is used in the contrastive loss to emphasize divergence. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. The dynamic prototype is used to encourage the variety. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards divergence and variety, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method.

CVDec 13, 2023
Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation

Yujun Chen, Xin Tan, Zhizhong Zhang et al.

As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semi-supervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix.

CVApr 22
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Wentao Yan, Shengqin Wang, Huichi Zhou et al.

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

CVApr 21
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

Shengqin Wang, Wentao Yan, Huichi Zhou et al.

Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$\%$ on FVQA-test.

AIApr 6
Memory Intelligence Agent

Jingyang Qiao, Weicheng Meng, Yu Cheng et al.

Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.

ROAug 4, 2025
NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Zhihao Luo, Wentao Yan, Jingyu Gong et al.

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design.

CVJun 23, 2025
YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos

Haoming Chen, Lichen Yuan, TianFang Sun et al.

3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet

LGFeb 20
Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

Jingyang Qiao, Zhizhong Zhang, Xin Tan et al.

Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.

CVJan 4
EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding

Tianjun Gu, Chenghua Gong, Jingyu Gong et al.

The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning--understanding the physical principles of object interactions--and Intent-Driven Reasoning--inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent's ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.

CVOct 27, 2025
Switchable Token-Specific Codebook Quantization For Face Image Compression

Yongbo Wang, Haonan Wang, Guodong Mu et al.

With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.

CVSep 30, 2025
PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion

Zhiwei Zhang, Ruikai Xu, Weijian Zhang et al.

In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.

LGAug 2, 2025
T2S: Tokenized Skill Scaling for Lifelong Imitation Learning

Hongquan Zhang, Jingyu Gong, Zhizhong Zhang et al.

The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.

CVJul 11, 2025
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning

Sen Wang, Shao Zeng, Tianjun Gu et al.

Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.

ROMay 28, 2025
DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

Tianjun Gu, Linfeng Li, Xuhong Wang et al.

Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON's effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.

LGMay 22, 2024
Gradient Projection For Continual Parameter-Efficient Tuning

Jingyang Qiao, Zhizhong Zhang, Xin Tan et al.

Parameter-efficient tunings (PETs) have demonstrated impressive performance and promising perspectives in training large models, while they are still confronted with a common problem: the trade-off between learning new content and protecting old knowledge, leading to zero-shot generalization collapse, and cross-modal hallucination. In this paper, we reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection, and firstly propose a unified framework called Parameter Efficient Gradient Projection (PEGP). We introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting even for large-scale models. It therefore modifies the gradient towards the direction that has less impact on the old feature space, with less extra memory space and training time. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets, and experiments comprehensively demonstrate its efficiency in reducing forgetting in class, online class, domain, task, and multi-modality continual settings. The project page is available at https://dmcv-ecnu-pegp.github.io/.

CVJun 4, 2024
FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping

Yuzhou Ji, He Zhu, Junshu Tang et al.

The semantically interactive radiance field has always been an appealing task for its potential to facilitate user-friendly and automated real-world 3D scene understanding applications. However, it is a challenging task to achieve high quality, efficiency and zero-shot ability at the same time with semantics in radiance fields. In this work, we present FastLGS, an approach that supports real-time open-vocabulary query within 3D Gaussian Splatting (3DGS) under high resolution. We propose the semantic feature grid to save multi-view CLIP features which are extracted based on Segment Anything Model (SAM) masks, and map the grids to low dimensional features for semantic field training through 3DGS. Once trained, we can restore pixel-aligned CLIP embeddings through feature grids from rendered features for open-vocabulary queries. Comparisons with other state-of-the-art methods prove that FastLGS can achieve the first place performance concerning both speed and accuracy, where FastLGS is 98x faster than LERF and 4x faster than LangSplat. Meanwhile, experiments show that FastLGS is adaptive and compatible with many downstream tasks, such as 3D segmentation and 3D object inpainting, which can be easily applied to other 3D manipulation systems.

CVMar 4, 2024
DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Jingyu Gong, Min Wang, Wentao Liu et al.

Motion synthesis in real-world 3D scenes has recently attracted much attention. However, the static environment assumption made by most current methods usually cannot be satisfied especially for real-time motion synthesis in scanned point cloud scenes, if multiple dynamic objects exist, e.g., moving persons or vehicles. To handle this problem, we propose the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene, and use it to dynamically update the latent motion for final motion synthesis. Concretely, we propose a Spherical-BEV perception method to extract local scene features that are specifically designed for instant scene-aware motion prediction. Then, we design a time-variant motion blending to fuse the new predicted motions into the latent motion, and the final motion is derived from the updated latent motions, benefitting both from motion-prior and iterative methods. We unify the data format of two prevailing datasets, PROX and GTA-IM, and take them for motion synthesis evaluation in 3D scenes. We also assess the effectiveness of the proposed method in dynamic environments from GTA-IM and Semantic3D to check the responsiveness. The results show our method outperforms previous works significantly and has great performance in handling dynamic environments.

CVMay 3, 2023
Cross-Stream Contrastive Learning for Self-Supervised Skeleton-Based Action Recognition

Ding Li, Yongqiang Tang, Zhizhong Zhang et al.

Self-supervised skeleton-based action recognition enjoys a rapid growth along with the development of contrastive learning. The existing methods rely on imposing invariance to augmentations of 3D skeleton within a single data stream, which merely leverages the easy positive pairs and limits the ability to explore the complicated movement patterns. In this paper, we advocate that the defect of single-stream contrast and the lack of necessary feature transformation are responsible for easy positives, and therefore propose a Cross-Stream Contrastive Learning framework for skeleton-based action Representation learning (CSCLR). Specifically, the proposed CSCLR not only utilizes intra-stream contrast pairs, but introduces inter-stream contrast pairs as hard samples to formulate a better representation learning. Besides, to further exploit the potential of positive pairs and increase the robustness of self-supervised representation learning, we propose a Positive Feature Transformation (PFT) strategy which adopts feature-level manipulation to increase the variance of positive pairs. To validate the effectiveness of our method, we conduct extensive experiments on three benchmark datasets NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD. Experimental results show that our proposed CSCLR exceeds the state-of-the-art methods on a diverse range of evaluation protocols.

CVNov 4, 2021
Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Ding Li, Rui Wu, Yongqiang Tang et al.

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.

CVApr 7, 2021
Farewell to Mutual Information: Variational Distillation for Cross-Modal Person Re-Identification

Xudong Tian, Zhizhong Zhang, Shaohui Lin et al.

The Information Bottleneck (IB) provides an information theoretic principle for representation learning, by retaining all information relevant for predicting label while minimizing the redundancy. Though IB principle has been applied to a wide range of applications, its optimization remains a challenging problem which heavily relies on the accurate estimation of mutual information. In this paper, we present a new strategy, Variational Self-Distillation (VSD), which provides a scalable, flexible and analytic solution to essentially fitting the mutual information but without explicitly estimating it. Under rigorously theoretical guarantee, VSD enables the IB to grasp the intrinsic correlation between representation and label for supervised training. Furthermore, by extending VSD to multi-view learning, we introduce two other strategies, Variational Cross-Distillation (VCD) and Variational Mutual-Learning (VML), which significantly improve the robustness of representation to view-changes by eliminating view-specific and task-irrelevant information. To verify our theoretically grounded strategies, we apply our approaches to cross-modal person Re-ID, and conduct extensive experiments, where the superior performance against state-of-the-art methods are demonstrated. Our intriguing findings highlight the need to rethink the way to estimate mutual

CVSep 27, 2017
Effective Image Retrieval via Multilinear Multi-index Fusion

Zhizhong Zhang, Yuan Xie, Wensheng Zhang et al.

Multi-index fusion has demonstrated impressive performances in retrieval task by integrating different visual representations in a unified framework. However, previous works mainly consider propagating similarities via neighbor structure, ignoring the high order information among different visual representations. In this paper, we propose a new multi-index fusion scheme for image retrieval. By formulating this procedure as a multilinear based optimization problem, the complementary information hidden in different indexes can be explored more thoroughly. Specially, we first build our multiple indexes from various visual representations. Then a so-called index-specific functional matrix, which aims to propagate similarities, is introduced for updating the original index. The functional matrices are then optimized in a unified tensor space to achieve a refinement, such that the relevant images can be pushed more closer. The optimization problem can be efficiently solved by the augmented Lagrangian method with theoretical convergence guarantee. Unlike the traditional multi-index fusion scheme, our approach embeds the multi-index subspace structure into the new indexes with sparse constraint, thus it has little additional memory consumption in online query stage. Experimental evaluation on three benchmark datasets reveals that the proposed approach achieves the state-of-the-art performance, i.e., N-score 3.94 on UKBench, mAP 94.1\% on Holiday and 62.39\% on Market-1501.