Junni Zou

CV
h-index27
35papers
809citations
Novelty58%
AI Score61

35 Papers

CVJul 13, 2024Code
Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Han Li, Shaohui Li, Shuangrui Ding et al.

Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrates with reduced overheads. We propose a spatial-frequency modulation adapter (SFMA) that simultaneously eliminates non-semantic redundancy with a spatial modulation adapter, and enhances task-relevant frequency components and suppresses task-irrelevant frequency components with a frequency modulation adapter. The proposed adapter is plug-and-play and compatible with almost all existing learned image compression models without compromising the performance of pre-trained models. Experiments demonstrate that Adapt-ICMH consistently outperforms existing ICMH frameworks on various machine vision tasks with fewer fine-tuned parameters and reduced computational complexity. Code will be released at https://github.com/qingshi9974/ECCV2024-AdpatICMH .

IVOct 25, 2023
Frequency-Aware Transformer for Learned Image Compression

Han Li, Shaohui Li, Wenrui Dai et al.

Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.

CVFeb 15, 2023
Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation

Han Li, Bowen Shi, Wenrui Dai et al.

There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies. However, existing transformer-based methods treat body joints as equally important inputs and ignore the prior knowledge of human skeleton topology in the self-attention mechanism. To tackle this issue, in this paper, we propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D HPE. Specifically, we first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. The pose-oriented self-attention mechanism explicitly models the topological interactions between body joints, whereas the distance-related position embedding encodes the distance of joints to the root joint to distinguish groups of joints with different difficulties in regression. Furthermore, we present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints, by considering the estimated uncertainty of each joint with uncertainty-guided sampling strategy and self-attention mechanism. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art methods with reduced model parameters on 3D HPE benchmarks such as Human3.6M and MPI-INF-3DHP

CVJan 21Code
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Xinyu Peng, Han Li, Yuyang Huang et al.

Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.

CVApr 25, 2022
Hybrid ISTA: Unfolding ISTA With Convergence Guarantees Using Free-Form Deep Neural Networks

Ziyang Zheng, Wenrui Dai, Duoduo Xue et al.

It is promising to solve linear inverse problems by unfolding iterative algorithms (e.g., iterative shrinkage thresholding algorithm (ISTA)) as deep neural networks (DNNs) with learnable parameters. However, existing ISTA-based unfolded algorithms restrict the network architectures for iterative updates with the partial weight coupling structure to guarantee convergence. In this paper, we propose hybrid ISTA to unfold ISTA with both pre-computed and learned parameters by incorporating free-form DNNs (i.e., DNNs with arbitrary feasible and reasonable network architectures), while ensuring theoretical convergence. We first develop HCISTA to improve the efficiency and flexibility of classical ISTA (with pre-computed parameters) without compromising the convergence rate in theory. Furthermore, the DNN-based hybrid algorithm is generalized to popular variants of learned ISTA, dubbed HLISTA, to enable a free architecture of learned parameters with a guarantee of linear convergence. To our best knowledge, this paper is the first to provide a convergence-provable framework that enables free-form DNNs in ISTA-based unfolded algorithms. This framework is general to endow arbitrary DNNs for solving linear inverse problems with convergence guarantees. Extensive experiments demonstrate that hybrid ISTA can reduce the reconstruction error with an improved convergence rate in the tasks of sparse recovery and compressive sensing.

CVMay 31, 2022
Hierarchical Spherical CNNs with Lifting-based Adaptive Wavelets for Pooling and Unpooling

Mingxing Xu, Chenglin Li, Wenrui Dai et al.

Pooling and unpooling are two essential operations in constructing hierarchical spherical convolutional neural networks (HS-CNNs) for comprehensive feature learning in the spherical domain. Most existing models employ downsampling-based pooling, which will inevitably incur information loss and cannot adapt to different spherical signals and tasks. Besides, the preserved information after pooling cannot be well restored by the subsequent unpooling to characterize the desirable features for a task. In this paper, we propose a novel framework of HS-CNNs with a lifting structure to learn adaptive spherical wavelets for pooling and unpooling, dubbed LiftHS-CNN, which ensures a more efficient hierarchical feature learning for both image- and pixel-level tasks. Specifically, adaptive spherical wavelets are learned with a lifting structure that consists of trainable lifting operators (i.e., update and predict operators). With this learnable lifting structure, we can adaptively partition a signal into two sub-bands containing low- and high-frequency components, respectively, and thus generate a better down-scaled representation for pooling by preserving more information in the low-frequency sub-band. The update and predict operators are parameterized with graph-based attention to jointly consider the signal's characteristics and the underlying geometries. We further show that particular properties are promised by the learned wavelets, ensuring the spatial-frequency localization for better exploiting the signal's correlation in both spatial and frequency domains. We then propose an unpooling operation that is invertible to the lifting-based pooling, where an inverse wavelet transform is performed by using the learned lifting operators to restore an up-scaled representation. Extensive empirical evaluations on various spherical domain tasks validate the superiority of the proposed LiftHS-CNN.

CVJun 28, 2023
Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Bowen Shi, Xiaopeng Zhang, Yaoming Wang et al.

Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, while MIM can introduce more local and diverse attention across all transformer layers. In this paper, we explore how to obtain a model that combines their strengths. We start by examining previous feature distillation and mask feature reconstruction methods and identify their limitations. We find that their increasing diversity mainly derives from the asymmetric designs, but these designs may in turn compromise the discrimination ability. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model. Hybrid Distill imitates the token relations of the MIM teacher to alleviate attention collapse, as well as distills the feature maps of the supervised/CL teacher to enable discrimination. Furthermore, a progressive redundant token masking strategy is also utilized to reduce the distilling costs and avoid falling into local optima. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.

LGApr 27, 2022
LiftPool: Lifting-based Graph Pooling for Hierarchical Graph Representation Learning

Mingxing Xu, Wenrui Dai, Chenglin Li et al.

Graph pooling has been increasingly considered for graph neural networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages, i.e., selecting the top-ranked nodes and removing the rest nodes to construct a coarsened graph representation. However, local structural information of the removed nodes would be inevitably dropped in these methods, due to the inherent coupling of nodes (location) and their features (signals). In this paper, we propose an enhanced three-stage method via lifting, named LiftPool, to improve hierarchical graph representation by maximally preserving the local structural information in graph pooling. LiftPool introduces an additional stage of graph lifting before graph coarsening to preserve the local information of the removed nodes and decouple the processes of node removing and feature reduction. Specifically, for each node to be removed, its local information is obtained by subtracting the global information aggregated from its neighboring preserved nodes. Subsequently, this local information is aligned and propagated to the preserved nodes to alleviate information loss in graph coarsening. Furthermore, we demonstrate that the proposed LiftPool is localized and permutation-invariant. The proposed graph lifting structure is general to be integrated with existing downsampling-based graph pooling methods. Evaluations on benchmark graph datasets show that LiftPool substantially outperforms the state-of-the-art graph pooling methods in the task of graph classification.

ITJun 6, 2022
Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Qi Wang, Ying Cui, Chenglin Li et al.

Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...

IVMar 5, 2023
Learned Lossless Compression for JPEG via Frequency-Domain Prediction

Jixiang Luo, Shaohui Li, Wenrui Dai et al.

JPEG images can be further compressed to enhance the storage and transmission of large-scale image datasets. Existing learned lossless compressors for RGB images cannot be well transferred to JPEG images due to the distinguishing distribution of DCT coefficients and raw pixels. In this paper, we propose a novel framework for learned lossless compression of JPEG images that achieves end-to-end optimized prediction of the distribution of decoded DCT coefficients. To enable learning in the frequency domain, DCT coefficients are partitioned into groups to utilize implicit local redundancy. An autoencoder-like architecture is designed based on the weight-shared blocks to realize entropy modeling of grouped DCT coefficients and independently compress the priors. We attempt to realize learned lossless compression of JPEG images in the frequency domain. Experimental results demonstrate that the proposed framework achieves superior or comparable performance in comparison to most recent lossless compressors with handcrafted context modeling for JPEG images.

LGApr 14
Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging

Xinyu Peng, Ziyang Zheng, Wenrui Dai et al.

Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).

CVDec 11, 2025
Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment

Han Li, Shaohui Li, Wenrui Dai et al.

Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.

CVJan 12, 2024Code
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Bowen Shi, Peisen Zhao, Zichen Wang et al.

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

CVNov 23, 2021Code
Hierarchical Graph Networks for 3D Human Pose Estimation

Han Li, Bowen Shi, Wenrui Dai et al.

Recent 2D-to-3D human pose estimation works tend to utilize the graph structure formed by the topology of the human skeleton. However, we argue that this skeletal topology is too sparse to reflect the body structure and suffer from serious 2D-to-3D ambiguity problem. To overcome these weaknesses, we propose a novel graph convolution network architecture, Hierarchical Graph Networks (HGN). It is based on denser graph topology generated by our multi-scale graph structure building strategy, thus providing more delicate geometric information. The proposed architecture contains three sparse-to-fine representation subnetworks organized in parallel, in which multi-scale graph-structured features are processed and exchange information through a novel feature fusion strategy, leading to rich hierarchical representations. We also introduce a 3D coarse mesh constraint to further boost detail-related feature learning. Extensive experiments demonstrate that our HGN achieves the state-of-the art performance with reduced network parameters. Code is released at https://github.com/qingshi9974/BMVC2021-Hierarchical-Graph-Networks-for-3D-Human-Pose-Estimation.

LGOct 19, 2020Code
MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization

Wen Fei, Wenrui Dai, Chenglin Li et al.

Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at https://github.com/Kid-key/MimicNorm.

CVAug 17, 2020Code
AP-Loss for Accurate One-Stage Object Detection

Kean Chen, Weiyao Lin, Jianguo Li et al.

One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We provide in-depth analyses on the good convergence property and computational complexity of the proposed algorithm, both theoretically and empirically. Experimental results demonstrate notable improvement in addressing the imbalance issue in object detection over existing AP-based optimization algorithms. An improved state-of-the-art performance is achieved in one-stage detectors based on AP-loss over detectors using classification-losses on various standard benchmarks. The proposed framework is also highly versatile in accommodating different network architectures. Code is available at https://github.com/cccorn/AP-loss .

CVApr 12, 2019Code
Towards Accurate One-Stage Object Detection with AP-Loss

Kean Chen, Jianguo Li, Weiyao Lin et al.

One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We verify good convergence property of the proposed algorithm theoretically and empirically. Experimental results demonstrate notable performance improvement in state-of-the-art one-stage detectors based on AP-loss over different kinds of classification-losses on various benchmarks, without changing the network architectures. Code is available at https://github.com/cccorn/AP-loss.

CVFeb 3, 2024
Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance

Xinyu Peng, Ziyang Zheng, Wenrui Dai et al.

Recent diffusion models provide a promising zero-shot solution to noisy linear inverse problems without retraining for specific inverse problems. In this paper, we reveal that recent methods can be uniformly interpreted as employing a Gaussian approximation with hand-crafted isotropic covariance for the intractable denoising posterior to approximate the conditional posterior mean. Inspired by this finding, we propose to improve recent methods by using more principled covariance determined by maximum likelihood estimation. To achieve posterior covariance optimization without retraining, we provide general plug-and-play solutions based on two approaches specifically designed for leveraging pre-trained models with and without reverse covariance. We further propose a scalable method for learning posterior covariance prediction based on representation with orthonormal basis. Experimental results demonstrate that the proposed methods significantly enhance reconstruction performance without requiring hyperparameter tuning.

CVMar 30, 2025
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Hongwei Zheng, Han Li, Wenrui Dai et al.

Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.

IVJan 23, 2025
On Disentangled Training for Nonlinear Transform in Learned Image Compression

Han Li, Shaohui Li, Wenrui Dai et al.

Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs, but is challenged by training inefficiency that could incur more than two weeks to train a state-of-the-art model from scratch. Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms. In this paper, we first reveal that such energy compaction consists of two components, i.e., feature decorrelation and uneven energy modulation. On such basis, we propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms. The proposed AuxT obtains coarse approximation to achieve efficient energy compaction such that distribution fitting with the nonlinear transforms can be simplified to fine details. We then develop wavelet-based linear shortcuts (WLSs) for AuxT that leverages wavelet-based downsampling and orthogonal linear projection for feature decorrelation and subband-aware scaling for

CVNov 21, 2024
Point Cloud Resampling with Learnable Heat Diffusion

Wenqiang Xu, Wenrui Dai, Duoduo Xue et al.

Generative diffusion models have shown empirical successes in point cloud resampling, generating a denser and more uniform distribution of points from sparse or noisy 3D point clouds by progressively refining noise into structure. However, existing diffusion models employ manually predefined schemes, which often fail to recover the underlying point cloud structure due to the rigid and disruptive nature of the geometric degradation. To address this issue, we propose a novel learnable heat diffusion framework for point cloud resampling, which directly parameterizes the marginal distribution for the forward process by learning the adaptive heat diffusion schedules and local filtering scales of the time-varying heat kernel, and consequently, generates an adaptive conditional prior for the reverse process. Unlike previous diffusion models with a fixed prior, the adaptive conditional prior selectively preserves geometric features of the point cloud by minimizing a refined variational lower bound, guiding the points to evolve towards the underlying surface during the reverse process. Extensive experimental results demonstrate that the proposed point cloud resampling achieves state-of-the-art performance in representative reconstruction tasks including point cloud denoising and upsampling.

LGDec 16, 2023
scBiGNN: Bilevel Graph Representation Learning for Cell Type Classification from Single-cell RNA Sequencing Data

Rui Yang, Wenrui Dai, Chenglin Li et al.

Single-cell RNA sequencing (scRNA-seq) technology provides high-throughput gene expression data to study the cellular heterogeneity and dynamics of complex organisms. Graph neural networks (GNNs) have been widely used for automatic cell type classification, which is a fundamental problem to solve in scRNA-seq analysis. However, existing methods do not sufficiently exploit both gene-gene and cell-cell relationships, and thus the true potential of GNNs is not realized. In this work, we propose a bilevel graph representation learning method, named scBiGNN, to simultaneously mine the relationships at both gene and cell levels for more accurate single-cell classification. Specifically, scBiGNN comprises two GNN modules to identify cell types. A gene-level GNN is established to adaptively learn gene-gene interactions and cell representations via the self-attention mechanism, and a cell-level GNN builds on the cell-cell graph that is constructed from the cell representations generated by the gene-level GNN. To tackle the scalability issue for processing a large number of cells, scBiGNN adopts an Expectation Maximization (EM) framework in which the two modules are alternately trained via the E-step and M-step to learn from each other. Through this interaction, the gene- and cell-level structural information is integrated to gradually enhance the classification performance of both GNN modules. Experiments on benchmark datasets demonstrate that our scBiGNN outperforms a variety of existing methods for cell type classification from scRNA-seq data.

CVAug 7, 2025
3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering

Junyu Zhou, Yuyang Huang, Wenrui Dai et al.

Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.

CVJun 11, 2025
Noise Conditional Variational Score Distillation

Xinyu Peng, Ziyang Zheng, Yaoming Wang et al.

We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

CVNov 21, 2024
Point Cloud Denoising With Fine-Granularity Dynamic Graph Convolutional Networks

Wenqiang Xu, Wenrui Dai, Duoduo Xue et al.

Due to limitations in acquisition equipment, noise perturbations often corrupt 3-D point clouds, hindering down-stream tasks such as surface reconstruction, rendering, and further processing. Existing 3-D point cloud denoising methods typically fail to reliably fit the underlying continuous surface, resulting in a degradation of reconstruction performance. This paper introduces fine-granularity dynamic graph convolutional networks called GD-GCN, a novel approach to denoising in 3-D point clouds. The GD-GCN employs micro-step temporal graph convolution (MST-GConv) to perform feature learning in a gradual manner. Compared with the conventional GCN, which commonly uses discrete integer-step graph convolution, this modification introduces a more adaptable and nuanced approach to feature learning within graph convolution networks. It more accurately depicts the process of fitting the point cloud with noise to the underlying surface by and the learning process for MST-GConv acts like a changing system and is managed through a type of neural network known as neural Partial Differential Equations (PDEs). This means it can adapt and improve over time. GD-GCN approximates the Riemannian metric, calculating distances between points along a low-dimensional manifold. This capability allows it to understand the local geometric structure and effectively capture diverse relationships between points from different geometric regions through geometric graph construction based on Riemannian distances. Additionally, GD-GCN incorporates robust graph spectral filters based on the Bernstein polynomial approximation, which modulate eigenvalues for complex and arbitrary spectral responses, providing theoretical guarantees for BIBO stability. Symmetric channel mixing matrices further enhance filter flexibility by enabling channel-level scaling and shifting in the spectral domain.

CVOct 29, 2025
Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation

Yuyang Huang, Yabo Chen, Junyu Zhou et al.

Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.

CVOct 23, 2025
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

Guanghao Zheng, Bowen Shi, Mingxing Xu et al.

Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.

LGAug 3, 2021
Graph Neural Networks With Lifting-based Adaptive Graph Wavelets

Mingxing Xu, Wenrui Dai, Chenglin Li et al.

Spectral-based graph neural networks (SGNNs) have been attracting increasing attention in graph representation learning. However, existing SGNNs are limited in implementing graph filters with rigid transforms (e.g., graph Fourier or predefined graph wavelet transforms) and cannot adapt to signals residing on graphs and tasks at hand. In this paper, we propose a novel class of graph neural networks that realizes graph filters with adaptive graph wavelets. Specifically, the adaptive graph wavelets are learned with neural network-parameterized lifting structures, where structure-aware attention-based lifting operations (i.e., prediction and update operations) are developed to jointly consider graph structures and node features. We propose to lift based on diffusion wavelets to alleviate the structural information loss induced by partitioning non-bipartite graphs. By design, the locality and sparsity of the resulting wavelet transform as well as the scalability of the lifting structure are guaranteed. We further derive a soft-thresholding filtering operation by learning sparse graph representations in terms of the learned wavelets, yielding a localized, efficient, and scalable wavelet-based graph filters. To ensure that the learned graph representations are invariant to node permutations, a layer is employed at the input of the networks to reorder the nodes according to their local topology information. We evaluate the proposed networks in both node-level and graph-level representation learning tasks on benchmark citation and bioinformatics graph datasets. Extensive experiments demonstrate the superiority of the proposed networks over existing SGNNs in terms of accuracy, efficiency, and scalability.

LGJun 18, 2021
Message Passing in Graph Convolution Networks via Adaptive Filter Banks

Xing Gao, Wenrui Dai, Chenglin Li et al.

Graph convolution networks, like message passing graph convolution networks (MPGCNs), have been a powerful tool in representation learning of networked data. However, when data is heterogeneous, most architectures are limited as they employ a single strategy to handle multi-channel graph signals and they typically focus on low-frequency information. In this paper, we present a novel graph convolution operator, termed BankGCN, which keeps benefits of message passing models, but extends their capabilities beyond `low-pass' features. It decomposes multi-channel signals on graphs into subspaces and handles particular information in each subspace with an adapted filter. The filters of all subspaces have different frequency responses and together form a filter bank. Furthermore, each filter in the spectral domain corresponds to a message passing scheme, and diverse schemes are implemented via the filter bank. Importantly, the filter bank and the signal decomposition are jointly learned to adapt to the spectral characteristics of data and to target applications. Furthermore, this is implemented almost without extra parameters in comparison with most existing MPGCNs. Experimental results show that the proposed convolution operator permits to achieve excellent performance in graph classification on a collection of benchmark graph datasets.

CVJun 8, 2021
Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

Bowen Shi, Xiaopeng Zhang, Haohang Xu et al.

Collecting annotated data for semantic segmentation is time-consuming and hard to scale up. In this paper, we for the first time propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets. The highlight is that the annotations from different domains can be efficiently reused and consistently boost performance for each specific domain. This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets regardless of their taxonomy labels, and followed by fine-tuning the pretrained model over specific dataset as usual. In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing and propose a pixel-to-class sparse coding strategy that explicitly models the pixel-class similarity over the manifold embedding space. In this way, we are able to increase intra-class compactness and inter-class separability, as well as considering inter-class similarity across different datasets for better transferability. Experiments conducted on several benchmarks demonstrate its superior performance. Notably, MDP consistently outperforms the pretrained models over ImageNet by a considerable margin, while only using less than 10% samples for pretraining.

LGDec 7, 2020
NCGNN: Node-Level Capsule Graph Neural Network for Semisupervised Classification

Rui Yang, Wenrui Dai, Chenglin Li et al.

Message passing has evolved as an effective tool for designing Graph Neural Networks (GNNs). However, most existing methods for message passing simply sum or average all the neighboring features to update node representations. They are restricted by two problems, i.e., (i) lack of interpretability to identify node features significant to the prediction of GNNs, and (ii) feature over-mixing that leads to the over-smoothing issue in capturing long-range dependencies and inability to handle graphs under heterophily or low homophily. In this paper, we propose a Node-level Capsule Graph Neural Network (NCGNN) to address these problems with an improved message passing scheme. Specifically, NCGNN represents nodes as groups of node-level capsules, in which each capsule extracts distinctive features of its corresponding node. For each node-level capsule, a novel dynamic routing procedure is developed to adaptively select appropriate capsules for aggregation from a subgraph identified by the designed graph filter. NCGNN aggregates only the advantageous capsules and restrains irrelevant messages to avoid over-mixing features of interacting nodes. Therefore, it can relieve the over-smoothing issue and learn effective node representations over graphs with homophily or heterophily. Furthermore, our proposed message passing scheme is inherently interpretable and exempt from complex post-hoc explanations, as the graph filter and the dynamic routing procedure identify a subset of node features that are most significant to the model prediction from the extracted subgraph. Extensive experiments on synthetic as well as real-world graphs demonstrate that NCGNN can well address the over-smoothing issue and produce better node representations for semisupervised node classification. It outperforms the state of the arts under both homophily and heterophily.

IVDec 25, 2019
Partition-Aware Adaptive Switching Neural Networks for Post-Processing in HEVC

Weiyao Lin, Xiaoyi He, Xintong Han et al.

This paper addresses neural network based post-processing for the state-of-the-art video coding standard, High Efficiency Video Coding (HEVC). We first propose a partition-aware Convolution Neural Network (CNN) that utilizes the partition information produced by the encoder to assist in the post-processing. In contrast to existing CNN-based approaches, which only take the decoded frame as input, the proposed approach considers the coding unit (CU) size information and combines it with the distorted decoded frame such that the artifacts introduced by HEVC are efficiently reduced. We further introduce an adaptive-switching neural network (ASN) that consists of multiple independent CNNs to adaptively handle the variations in content and distortion within compressed-video frames, providing further reduction in visual artifacts. Additionally, an iterative training procedure is proposed to train these independent CNNs attentively on different local patch-wise classes. Experiments on benchmark sequences demonstrate the effectiveness of our partition-aware and adaptive-switching neural networks. The source code can be found at http://min.sjtu.edu.cn/lwydemo/HEVCpostprocessing.html.

MMJun 20, 2019
Probabilistic Tile Visibility-Based Server-Side Rate Adaptation for Adaptive 360-Degree Video Streaming

Junni Zou, Chenglin Li, Chengming Liu et al.

In this paper, we study the server-side rate adaptation problem for streaming tile-based adaptive 360-degree videos to multiple users who are competing for transmission resources at the network bottleneck. Specifically, we develop a convolutional neural network (CNN)-based viewpoint prediction model to capture the nonlinear relationship between the future and historical viewpoints. A Laplace distribution model is utilized to characterize the probability distribution of the prediction error. Given the predicted viewpoint, we then map the viewport in the spherical space into its corresponding planar projection in the 2-D plane, and further derive the visibility probability of each tile based on the planar projection and the prediction error probability. According to the visibility probability, tiles are classified as viewport, marginal and invisible tiles. The server-side tile rate allocation problem for multiple users is then formulated as a non-linear discrete optimization problem to minimize the overall received video distortion of all users and the quality difference between the viewport and marginal tiles of each user, subject to the transmission capacity constraints and users' specific viewport requirements. We develop a steepest descent algorithm to solve this non-linear discrete optimization problem, by initializing the feasible starting point in accordance with the optimal solution of its continuous relaxation. Extensive experimental results show that the proposed algorithm can achieve a near-optimal solution, and outperforms the existing rate adaptation schemes for tile-based adaptive 360-video streaming.

CVMay 17, 2019
Group Re-Identification with Multi-grained Matching and Integration

Weiyao Lin, Yuxi Li, Hao Xiao et al.

The task of re-identifying groups of people underdifferent camera views is an important yet less-studied problem.Group re-identification (Re-ID) is a very challenging task sinceit is not only adversely affected by common issues in traditionalsingle object Re-ID problems such as viewpoint and human posevariations, but it also suffers from changes in group layout andgroup membership. In this paper, we propose a novel conceptof group granularity by characterizing a group image by multi-grained objects: individual persons and sub-groups of two andthree people within a group. To achieve robust group Re-ID,we first introduce multi-grained representations which can beextracted via the development of two separate schemes, i.e. onewith hand-crafted descriptors and another with deep neuralnetworks. The proposed representation seeks to characterize bothappearance and spatial relations of multi-grained objects, and isfurther equipped with importance weights which capture varia-tions in intra-group dynamics. Optimal group-wise matching isfacilitated by a multi-order matching process which in turn,dynamically updates the importance weights in iterative fashion.We evaluated on three multi-camera group datasets containingcomplex scenarios and large dynamics, with experimental resultsdemonstrating the effectiveness of our approach. The published dataset can be found in \url{http://min.sjtu.edu.cn/lwydemo/GroupReID.html}

MMFeb 2, 2019
Multiuser Video Streaming Rate Adaptation: A Physical Layer Resource-Aware Deep Reinforcement Learning Approach

Kexin Tang, Nuowen Kan, Junni Zou et al.

We consider a multi-user video streaming service optimization problem over a time-varying and mutually interfering multi-cell wireless network. The key research challenge is to appropriately adapt each user's video streaming rate according to the radio frequency environment (e.g., channel fading and interference level) and service demands (e.g., play request), so that the users' long-term experience for watching videos can be optimized. To address the above challenge, we propose a novel two-level cross-layer optimization framework for multiuser adaptive video streaming over wireless networks. The key idea is to jointly design the physical layer optimization-based beamforming scheme (performed at the base stations) and the application layer Deep Reinforcement Learning (DRL)-based scheme (performed at the user terminals), so that a highly complex multi-user, cross-layer, time-varying video streaming problem can be decomposed into relatively simple problems and solved effectively. Our strategy represents a significant departure for the existing schemes where either short-term user experience optimization is considered, or only single-user point-to-point long-term optimization is considered. Extensive simulations based on real-data sets show that the proposed cross-layer design is effective and promising.