Guojun Qi

CV
h-index31
16papers
619citations
Novelty57%
AI Score42

16 Papers

CVMar 26, 2023
OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering

Zhiyuan Ma, Xiangyu Zhu, Guojun Qi et al.

Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at $35$ FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency.

CLMay 18, 2025Code
SLOT: Sample-specific Language Model Optimization at Test-time

Yang Hu, Xingyu Zhang, Xueji Fang et al.

We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.

CVMar 5, 2021Code
Causal Attention for Vision-Language Tasks

Xu Yang, Hanwang Zhang, Guojun Qi et al.

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

CVFeb 8, 2024
DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

Zhiyuan Ma, Xiangyu Zhu, Guojun Qi et al.

Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves state-of-the-art performance on existing benchmarks, but also fast inference speed owing to its ability to generate facial motions in parallel.

CVDec 12, 2023
Lightweight high-resolution Subject Matting in the Real World

Peng Liu, Fanyi Wang, Jingwen Su et al.

Existing saliency object detection (SOD) methods struggle to satisfy fast inference and accurate results simultaneously in high resolution scenes. They are limited by the quality of public datasets and efficient network modules for high-resolution images. To alleviate these issues, we propose to construct a saliency object matting dataset HRSOM and a lightweight network PSUNet. Considering efficient inference of mobile depolyment framework, we design a symmetric pixel shuffle module and a lightweight module TRSU. Compared to 13 SOD methods, the proposed PSUNet has the best objective performance on the high-resolution benchmark dataset. Evaluation results of objective assessment are superior compared to U$^2$Net that has 10 times of parameter amount of our network. On Snapdragon 8 Gen 2 Mobile Platform, inference a single 640$\times$640 image only takes 113ms. And on the subjective assessment, evaluation results are better than the industry benchmark IOS16 (Lift subject from background).

CVMar 15, 2025
Learning Dual-Domain Multi-Scale Representations for Single Image Deraining

Shun Zou, Yi Zou, Mingya Zhang et al.

Existing image deraining methods typically rely on single-input, single-output, and single-scale architectures, which overlook the joint multi-scale information between external and internal features. Furthermore, single-domain representations are often too restrictive, limiting their ability to handle the complexities of real-world rain scenarios. To address these challenges, we propose a novel Dual-Domain Multi-Scale Representation Network (DMSR). The key idea is to exploit joint multi-scale representations from both external and internal domains in parallel while leveraging the strengths of both spatial and frequency domains to capture more comprehensive properties. Specifically, our method consists of two main components: the Multi-Scale Progressive Spatial Refinement Module (MPSRM) and the Frequency Domain Scale Mixer (FDSM). The MPSRM enables the interaction and coupling of multi-scale expert information within the internal domain using a hierarchical modulation and fusion strategy. The FDSM extracts multi-scale local information in the spatial domain, while also modeling global dependencies in the frequency domain. Extensive experiments show that our model achieves state-of-the-art performance across six benchmark datasets.

CVOct 12, 2025
When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Jinjin Cao, Zhiyang Chen, Zijun Wang et al.

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

CVApr 23, 2025
Cross Paradigm Representation and Alignment Transformer for Image Deraining

Shun Zou, Yi Zou, Juncheng Li et al.

Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

CVJan 20, 2024
UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures

Mingyuan Zhou, Rakib Hyder, Ziwei Xuan et al.

Recent advances in 3D avatar generation have gained significant attentions. These breakthroughs aim to produce more realistic animatable avatars, narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss, combined with a differentiable renderer and text condition, to guide a diffusion model in generating 3D avatars. However, SDS often generates oversmoothed results with few facial details, thereby lacking the diversity compared with ancestral sampling. On the other hand, other works generate 3D avatar from a single image, where the challenges of unwanted lighting effects, perspective views, and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper, we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry, and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end, the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method, outperforming the state-of-the-art methods by a large margin in the experiments.

CVMay 2, 2023
LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar

Yuelang Xu, Hongwen Zhang, Lizhen Wang et al.

Existing approaches to animatable NeRF-based head avatars are either built upon face templates or use the expression coefficients of templates as the driving signal. Despite the promising progress, their performances are heavily bound by the expression power and the tracking accuracy of the templates. In this work, we present LatentAvatar, an expressive neural head avatar driven by latent expression codes. Such latent expression codes are learned in an end-to-end and self-supervised manner without templates, enabling our method to get rid of expression and tracking issues. To achieve this, we leverage a latent head NeRF to learn the person-specific latent expression codes from a monocular portrait video, and further design a Y-shaped network to learn the shared latent expression codes of different subjects for cross-identity reenactment. By optimizing the photometric reconstruction objectives in NeRF, the latent expression codes are learned to be 3D-aware while faithfully capturing the high-frequency detailed expressions. Moreover, by learning a mapping between the latent expression code learned in shared and person-specific settings, LatentAvatar is able to perform expressive reenactment between different subjects. Experimental results show that our LatentAvatar is able to capture challenging expressions and the subtle movement of teeth and even eyeballs, which outperforms previous state-of-the-art solutions in both quantitative and qualitative comparisons. Project page: https://www.liuyebin.com/latentavatar.

CVDec 29, 2019
FLAT: Few-Shot Learning via Autoencoding Transformation Regularizers

Haohang Xu, Hongkai Xiong, Guojun Qi

One of the most significant challenges facing a few-shot learning task is the generalizability of the (meta-)model from the base to the novel categories. Most of existing few-shot learning models attempt to address this challenge by either learning the meta-knowledge from multiple simulated tasks on the base categories, or resorting to data augmentation by applying various transformations to training examples. However, the supervised nature of model training in these approaches limits their ability of exploring variations across different categories, thus restricting their cross-category generalizability in modeling novel concepts. To this end, we present a novel regularization mechanism by learning the change of feature representations induced by a distribution of transformations without using the labels of data examples. We expect this regularizer could expand the semantic space of base categories to cover that of novel categories through the transformation of feature representations. It could minimize the risk of overfitting into base categories by inspecting the transformation-augmented variations at the encoded feature level. This results in the proposed FLAT (Few-shot Learning via Autoencoding Transformations) approach by autoencoding the applied transformations. The experiment results show the superior performances to the current state-of-the-art methods in literature.

CVOct 25, 2019
An End-to-End Foreground-Aware Network for Person Re-Identification

Yiheng Liu, Wengang Zhou, Jianzhuang Liu et al.

Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. In person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. To this end, we propose an end-to-end foreground-aware network to discriminate foreground from background by learning a soft mask for person re-identification. In our method, in addition to the pedestrian ID as supervision for foreground, we introduce the camera ID of each pedestrian image for background modeling. The foreground branch and the background branch are optimized collaboratively. By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to the backgrounds, which greatly reduces the negative impacts of changing backgrounds on matching an identical across different camera views. Notably, in contrast to existing methods, our approach does not require any additional dataset to train a human landmark detector or a segmentation model for locating the background regions. The experimental results conducted on three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, demonstrate the effectiveness of our approach.

CVDec 11, 2018
Prior-Knowledge and Attention-based Meta-Learning for Few-Shot Learning

Yunxiao Qin, Weiguo Zhang, Chenxu Zhao et al.

Recently, meta-learning has been shown as a promising way to solve few-shot learning. In this paper, inspired by the human cognition process which utilizes both prior-knowledge and vision attention in learning new knowledge, we present a novel paradigm of meta-learning approach with three developments to introduce attention mechanism and prior-knowledge for meta-learning. In our approach, prior-knowledge is responsible for helping meta-learner expressing the input data into high-level representation space, and attention mechanism enables meta-learner focusing on key features of the data in the representation space. Compared with existing meta-learning approaches that pay little attention to prior-knowledge and vision attention, our approach alleviates the meta-learner's few-shot cognition burden. Furthermore, a Task-Over-Fitting (TOF) problem, which indicates that the meta-learner has poor generalization on different K-shot learning tasks, is discovered and we propose a Cross-Entropy across Tasks (CET) metric to model and solve the TOF problem. Extensive experiments demonstrate that we improve the meta-learner with state-of-the-art performance on several few-shot learning benchmarks, and at the same time the TOF problem can also be released greatly.

CVNov 13, 2018
Exploiting temporal and depth information for multi-frame face anti-spoofing

Zezheng Wang, Chenxu Zhao, Yunxiao Qin et al.

Face anti-spoofing is significant to the security of face recognition systems. Previous works on depth supervised learning have proved the effectiveness for face anti-spoofing. Nevertheless, they only considered the depth as an auxiliary supervision in the single frame. Different from these methods, we develop a new method to estimate depth information from multiple RGB frames and propose a depth-supervised architecture which can efficiently encodes spatiotemporal information for presentation attack detection. It includes two novel modules: optical flow guided feature block (OFFB) and convolution gated recurrent units (ConvGRU) module, which are designed to extract short-term and long-term motion to discriminate living and spoofing faces. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art results on four benchmark datasets, namely OULU-NPU, SiW, CASIA-MFSD, and Replay-Attack.

CVJun 8, 2018
Large-scale Bisample Learning on ID Versus Spot Face Recognition

Xiangyu Zhu, Hao Liu, Zhen Lei et al.

In real-world face recognition applications, there is a tremendous amount of data with two images for each person. One is an ID photo for face enrollment, and the other is a probe photo captured on spot. Most existing methods are designed for training data with limited breadth (a relatively small number of classes) and sufficient depth (many samples for each class). They would meet great challenges on ID versus Spot (IvS) data, including the under-represented intra-class variations and an excessive demand on computing devices. In this paper, we propose a deep learning based large-scale bisample learning (LBL) method for IvS face recognition. To tackle the bisample problem with only two samples for each class, a classification-verification-classification (CVC) training strategy is proposed to progressively enhance the IvS performance. Besides, a dominant prototype softmax (DP-softmax) is incorporated to make the deep learning scalable on large-scale classes. We conduct LBL on a IvS face dataset with more than two million identities. Experimental results show the proposed method achieves superior performance to previous ones, validating the effectiveness of LBL on IvS face recognition.

LGMar 19, 2015
Rank Subspace Learning for Compact Hash Codes

Kai Li, Guojun Qi, Jun Ye et al.

The era of Big Data has spawned unprecedented interests in developing hashing algorithms for efficient storage and fast nearest neighbor search. Most existing work learn hash functions that are numeric quantizations of feature values in projected feature space. In this work, we propose a novel hash learning framework that encodes feature's rank orders instead of numeric values in a number of optimal low-dimensional ranking subspaces. We formulate the ranking subspace learning problem as the optimization of a piece-wise linear convex-concave function and present two versions of our algorithm: one with independent optimization of each hash bit and the other exploiting a sequential learning framework. Our work is a generalization of the Winner-Take-All (WTA) hash family and naturally enjoys all the numeric stability benefits of rank correlation measures while being optimized to achieve high precision at very short code length. We compare with several state-of-the-art hashing algorithms in both supervised and unsupervised domain, showing superior performance in a number of data sets.