Zhijing Yang

CV
h-index26
38papers
645citations
Novelty48%
AI Score55

38 Papers

CVMay 11, 2022
NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

Yawei Li, Kai Zhang, Radu Timofte et al. · eth-zurich, tencent-ai

This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.

CVMay 26, 2022
Dual-Perspective Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

Tao Pu, Tianshui Chen, Hefeng Wu et al.

Despite achieving impressive progress, current multi-label image recognition (MLR) algorithms heavily depend on large-scale datasets with complete labels, making collecting large-scale datasets extremely time-consuming and labor-intensive. Training the multi-label image recognition models with partial labels (MLR-PL) is an alternative way, in which merely some labels are known while others are unknown for each image. However, current MLP-PL algorithms rely on pre-trained image similarity models or iteratively updating the image classification models to generate pseudo labels for the unknown labels. Thus, they depend on a certain amount of annotations and inevitably suffer from obvious performance drops, especially when the known label proportion is low. To address this dilemma, we propose a dual-perspective semantic-aware representation blending (DSRB) that blends multi-granularity category-specific semantic representation across different images, from instance and prototype perspective respectively, to transfer information of known labels to complement unknown labels. Specifically, an instance-perspective representation blending (IPRB) module is designed to blend the representations of the known labels in an image with the representations of the corresponding unknown labels in another image to complement these unknown labels. Meanwhile, a prototype-perspective representation blending (PPRB) module is introduced to learn more stable representation prototypes for each category and blends the representation of unknown labels with the prototypes of corresponding labels, in a location-sensitive manner, to complement these unknown labels. Extensive experiments on the MS-COCO, Visual Genome, and Pascal VOC 2007 datasets show that the proposed DSRB consistently outperforms current state-of-the-art algorithms on all known label proportion settings.

CVJan 3, 2023
OccluMix: Towards De-Occlusion Virtual Try-on by Semantically-Guided Mixup

Zhijing Yang, Junyang Chen, Yukai Shi et al.

Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.

CVJun 6, 2022
Real-World Image Super-Resolution by Exclusionary Dual-Learning

Hao Li, Jinghui Qin, Zhijing Yang et al.

Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input, has recently received considerable attention with regard to its tremendous application potentials. Although deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets, they ignore the relationship between L1- and perceptual- minimization and roughly adopt auxiliary large-scale datasets for pre-training. In this paper, we discuss the image types within a corrupted image and the property of perceptual- and Euclidean- based evaluation protocols. Then we propose a method, Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1- based cooperative learning. Moreover, a noise-guidance data collection strategy is developed to address the training time consumption in multiple datasets optimization. When an auxiliary dataset is incorporated, RWSR-EDL achieves promising results and repulses any training time increment by adopting the noise-guidance data collection strategy. Extensive experiments show that RWSR-EDL achieves competitive performance over state-of-the-art methods on four in-the-wild image super-resolution datasets.

CVJul 26, 2022
Criteria Comparative Learning for Real-scene Image Super-Resolution

Yukai Shi, Hao Li, Sen Zhang et al.

Real-scene image super-resolution aims to restore real-world low-resolution images into their high-quality versions. A typical RealSR framework usually includes the optimization of multiple criteria which are designed for different image properties, by making the implicit assumption that the ground-truth images can provide a good trade-off between different criteria. However, this assumption could be easily violated in practice due to the inherent contrastive relationship between different image properties. Contrastive learning (CL) provides a promising recipe to relieve this problem by learning discriminative features using the triplet contrastive losses. Though CL has achieved significant success in many computer vision tasks, it is non-trivial to introduce CL to RealSR due to the difficulty in defining valid positive image pairs in this case. Inspired by the observation that the contrastive relationship could also exist between the criteria, in this work, we propose a novel training paradigm for RealSR, named Criteria Comparative Learning (Cria-CL), by developing contrastive losses defined on criteria instead of image patches. In addition, a spatial projector is proposed to obtain a good view for Cria-CL in RealSR. Our experiments demonstrate that compared with the typical weighted regression strategy, our method achieves a significant improvement under similar parameter settings.

CVJul 28, 2022
DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer

Hao Li, Zhijing Yang, Xiaobin Hong et al.

Real-world image denoising is a practical image restoration problem that aims to obtain clean images from in-the-wild noisy inputs. Recently, the Vision Transformer (ViT) has exhibited a strong ability to capture long-range dependencies, and many researchers have attempted to apply the ViT to image denoising tasks. However, a real-world image is an isolated frame that makes the ViT build long-range dependencies based on the internal patches, which divides images into patches, disarranges noise patterns and damages gradient continuity. In this article, we propose to resolve this issue by using a continuous Wavelet Sliding-Transformer that builds frequency correspondences under real-world scenes, called DnSwin. Specifically, we first extract the bottom features from noisy input images by using a convolutional neural network (CNN) encoder. The key to DnSwin is to extract high-frequency and low-frequency information from the observed features and build frequency dependencies. To this end, we propose a Wavelet Sliding-Window Transformer (WSWT) that utilizes the discrete wavelet transform (DWT), self-attention and the inverse DWT (IDWT) to extract deep features. Finally, we reconstruct the deep features into denoised images using a CNN decoder. Both quantitative and qualitative evaluations conducted on real-world denoising benchmarks demonstrate that the proposed DnSwin performs favorably against the state-of-the-art methods.

CVMay 23, 2022
Heterogeneous Semantic Transfer for Multi-label Recognition with Partial Labels

Tianshui Chen, Tao Pu, Lingbo Liu et al.

Multi-label image recognition with partial labels (MLR-PL), in which some labels are known while others are unknown for each image, may greatly reduce the cost of annotation and thus facilitate large-scale MLR. We find that strong semantic correlations exist within each image and across different images, and these correlations can help transfer the knowledge possessed by the known labels to retrieve the unknown labels and thus improve the performance of the MLR-PL task (see Figure 1). In this work, we propose a novel heterogeneous semantic transfer (HST) framework that consists of two complementary transfer modules that explore both within-image and cross-image semantic correlations to transfer the knowledge possessed by known labels to generate pseudo labels for the unknown labels. Specifically, an intra-image semantic transfer (IST) module learns an image-specific label co-occurrence matrix for each image and maps the known labels to complement the unknown labels based on these matrices. Additionally, a cross-image transfer (CST) module learns category-specific feature-prototype similarities and then helps complement the unknown labels that have high degrees of similarity with the corresponding prototypes. Finally, both the known and generated pseudo labels are used to train MLR models. Extensive experiments conducted on the Microsoft COCO, Visual Genome, and Pascal VOC 2007 datasets show that the proposed HST framework achieves superior performance to that of current state-of-the-art algorithms. Specifically, it obtains mean average precision (mAP) improvements of 1.4%, 3.3%, and 0.4% on the three datasets over the results of the best-performing previously developed algorithm.

CVJul 9, 2024
Dynamic Correlation Learning and Regularization for Multi-Label Confidence Calibration

Tianshui Chen, Weihang Wang, Tao Pu et al.

Modern visual recognition models often display overconfidence due to their reliance on complex deep neural networks and one-hot target supervision, resulting in unreliable confidence scores that necessitate calibration. While current confidence calibration techniques primarily address single-label scenarios, there is a lack of focus on more practical and generalizable multi-label contexts. This paper introduces the Multi-Label Confidence Calibration (MLCC) task, aiming to provide well-calibrated confidence scores in multi-label scenarios. Unlike single-label images, multi-label images contain multiple objects, leading to semantic confusion and further unreliability in confidence scores. Existing single-label calibration methods, based on label smoothing, fail to account for category correlations, which are crucial for addressing semantic confusion, thereby yielding sub-optimal performance. To overcome these limitations, we propose the Dynamic Correlation Learning and Regularization (DCLR) algorithm, which leverages multi-grained semantic correlations to better model semantic confusion for adaptive regularization. DCLR learns dynamic instance-level and prototype-level similarities specific to each category, using these to measure semantic correlations across different categories. With this understanding, we construct adaptive label vectors that assign higher values to categories with strong correlations, thereby facilitating more effective regularization. We establish an evaluation benchmark, re-implementing several advanced confidence calibration algorithms and applying them to leading multi-label recognition (MLR) models for fair comparison. Through extensive experiments, we demonstrate the superior performance of DCLR over existing methods in providing reliable confidence scores in multi-label scenarios.

CVMar 20, 2023
Open-World Pose Transfer via Sequential Test-Time Adaption

Junyang Chen, Xiaoyu Xian, Zhijing Yang et al.

Pose transfer aims to transfer a given person into a specified posture, has recently attracted considerable attention. A typical pose transfer framework usually employs representative datasets to train a discriminative model, which is often violated by out-of-distribution (OOD) instances. Recently, test-time adaption (TTA) offers a feasible solution for OOD data by using a pre-trained model that learns essential features with self-supervision. However, those methods implicitly make an assumption that all test distributions have a unified signal that can be learned directly. In open-world conditions, the pose transfer task raises various independent signals: OOD appearance and skeleton, which need to be extracted and distributed in speciality. To address this point, we develop a SEquential Test-time Adaption (SETA). In the test-time phrase, SETA extracts and distributes external appearance texture by augmenting OOD data for self-supervised training. To make non-Euclidean similarity among different postures explicit, SETA uses the image representations derived from a person re-identification (Re-ID) model for similarity computation. By addressing implicit posture representation in the test-time sequentially, SETA greatly improves the generalization performance of current pose transfer models. In our experiment, we first show that pose transfer can be applied to open-world applications, including Tiktok reenactment and celebrity motion synthesis.

LGAug 2, 2023
A Transformer-based Prediction Method for Depth of Anesthesia During Target-controlled Infusion of Propofol and Remifentanil

Yongkang He, Siyuan Peng, Mingjin Chen et al.

Accurately predicting anesthetic effects is essential for target-controlled infusion systems. The traditional (PK-PD) models for Bispectral index (BIS) prediction require manual selection of model parameters, which can be challenging in clinical settings. Recently proposed deep learning methods can only capture general trends and may not predict abrupt changes in BIS. To address these issues, we propose a transformer-based method for predicting the depth of anesthesia (DOA) using drug infusions of propofol and remifentanil. Our method employs long short-term memory (LSTM) and gate residual network (GRN) networks to improve the efficiency of feature fusion and applies an attention mechanism to discover the interactions between the drugs. We also use label distribution smoothing and reweighting losses to address data imbalance. Experimental results show that our proposed method outperforms traditional PK-PD models and previous deep learning methods, effectively predicting anesthetic depth under sudden and deep anesthesia conditions.

CVAug 2, 2023
Data-Centric Diet: Effective Multi-center Dataset Pruning for Medical Image Segmentation

Yongkang He, Mingjin Chen, Zhijing Yang et al.

This paper seeks to address the dense labeling problems where a significant fraction of the dataset can be pruned without sacrificing much accuracy. We observe that, on standard medical image segmentation benchmarks, the loss gradient norm-based metrics of individual training examples applied in image classification fail to identify the important samples. To address this issue, we propose a data pruning method by taking into consideration the training dynamics on target regions using Dynamic Average Dice (DAD) score. To the best of our knowledge, we are among the first to address the data importance in dense labeling tasks in the field of medical image analysis, making the following contributions: (1) investigating the underlying causes with rigorous empirical analysis, and (2) determining effective data pruning approach in dense labeling problems. Our solution can be used as a strong yet simple baseline to select important examples for medical image segmentation with combined data sources.

CVAug 17, 2023
Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Mingjin Chen, Yongkang He, Yongyi Lu et al.

We aim at incorporating explicit shape information into current 3D organ segmentation models. Different from previous works, we formulate shape learning as an in-painting task, which is named Masked Label Mask Modeling (MLM). Through MLM, learnable mask tokens are fed into transformer blocks to complete the label mask of organ. To transfer MLM shape knowledge to target, we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts with at least 1.2 points gain in the Dice score, demonstrating the effectiveness of our method in challenging unsupervised domain adaptation scenarios including: (1) In-domain organ segmentation; (2) Unseen domain segmentation and (3) Unseen organ segmentation. We hope this work will advance shape analysis and geometric learning in medical imaging.

CVApr 28
Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

Tianshui Chen, Yujie Zhu, Jianman Lin et al.

Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.

CVApr 22
Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

Tianshui Chen, Jianman Lin, Zhijing Yang et al.

Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.

CVJan 30
Neural Clothing Tryer: Customized Virtual Try-On via Semantic Enhancement and Controlling Diffusion Model

Zhijing Yang, Weiwei Zhang, Mingliang Yang et al.

This work aims to address a novel Customized Virtual Try-ON (Cu-VTON) task, enabling the superimposition of a specified garment onto a model that can be customized in terms of appearance, posture, and additional attributes. Compared with traditional VTON task, it enables users to tailor digital avatars to their individual preferences, thereby enhancing the virtual fitting experience with greater flexibility and engagement. To address this task, we introduce a Neural Clothing Tryer (NCT) framework, which exploits the advanced diffusion models equipped with semantic enhancement and controlling modules to better preserve semantic characterization and textural details of the garment and meanwhile facilitating the flexible editing of the model's postures and appearances. Specifically, NCT introduces a semantic-enhanced module to take semantic descriptions of garments and utilizes a visual-language encoder to learn aligned features across modalities. The aligned features are served as condition input to the diffusion model to enhance the preservation of the garment's semantics. Then, a semantic controlling module is designed to take the garment image, tailored posture image, and semantic description as input to maintain garment details while simultaneously editing model postures, expressions, and various attributes. Extensive experiments on the open available benchmark demonstrate the superior performance of the proposed NCT framework.

CVNov 9, 2023
Active Mining Sample Pair Semantics for Image-text Matching

Yongfeng Chena, Jin Liua, Zhijing Yang et al.

Recently, commonsense learning has been a hot topic in image-text matching. Although it can describe more graphic correlations, commonsense learning still has some shortcomings: 1) The existing methods are based on triplet semantic similarity measurement loss, which cannot effectively match the intractable negative in image-text sample pairs. 2) The weak generalization ability of the model leads to the poor effect of image and text matching on large-scale datasets. According to these shortcomings. This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS). Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss (AHRL) has diversified learning modes. Its active learning mode enables the model to more focus on the intractable negative samples to enhance the discriminating ability. In addition, AMSPS can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of the model. Experimental results on Flickr30K and MSCOCO universal datasets show that our proposed method is superior to advanced comparison methods.

CVFeb 18, 2025Code
GVTNet: Graph Vision Transformer For Face Super-Resolution

Chao Yang, Yong Fan, Cheng Lu et al.

Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at https://github.com/continueyang/GVTNet

CVFeb 18, 2025Code
DeltaDiff: Reality-Driven Diffusion with AnchorResiduals for Faithful SR

Chao Yang, Yong Fan, Qichao Zhang et al.

Recently, the transfer application of diffusion models in super-resolu-tion tasks has faced the problem ofdecreased fidelity. Due to the inherent randomsampling characteristics ofdiffusion models, direct application in super-resolu-tion tasks can result in generated details deviating from the true distribution ofhigh-resolution images. To address this, we propose DeltaDiff, a novel frame.work that constrains the difusion process, its essence is to establish a determin-istic mapping path between HR and LR, rather than the random noise disturbanceprocess oftraditional difusion models. Theoretical analysis demonstrates a 25%reduction in diffusion entropy in the residual space compared to pixel-space diffiusion, effectively suppressing irrelevant noise interference. The experimentalresults show that our method surpasses state-of-the-art models and generates re-sults with better fidelity. This work establishes a new low-rank constrained par-adigm for applying diffusion models to image reconstruction tasks, balancingstochastic generation with structural fidelity. Our code and model are publiclyavailable at https://github.com/continueyang/DeltaDiff .

CVApr 8, 2025
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation

Tianshui Chen, Jianman Lin, Zhijing Yang et al.

Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.

CLJan 3, 2025
Adaptive Few-shot Prompting for Machine Translation with Pre-trained Language Models

Lei Tang, Jinghui Qin, Wenxuan Ye et al.

Recently, Large language models (LLMs) with in-context learning have demonstrated remarkable potential in handling neural machine translation. However, existing evidence shows that LLMs are prompt-sensitive and it is sub-optimal to apply the fixed prompt to any input for downstream machine translation tasks. To address this issue, we propose an adaptive few-shot prompting (AFSP) framework to automatically select suitable translation demonstrations for various source input sentences to further elicit the translation capability of an LLM for better machine translation. First, we build a translation demonstration retrieval module based on LLM's embedding to retrieve top-k semantic-similar translation demonstrations from aligned parallel translation corpus. Rather than using other embedding models for semantic demonstration retrieval, we build a hybrid demonstration retrieval module based on the embedding layer of the deployed LLM to build better input representation for retrieving more semantic-related translation demonstrations. Then, to ensure better semantic consistency between source inputs and target outputs, we force the deployed LLM itself to generate multiple output candidates in the target language with the help of translation demonstrations and rerank these candidates. Besides, to better evaluate the effectiveness of our AFSP framework on the latest language and extend the research boundary of neural machine translation, we construct a high-quality diplomatic Chinese-English parallel dataset that consists of 5,528 parallel Chinese-English sentences. Finally, extensive experiments on the proposed diplomatic Chinese-English parallel dataset and the United Nations Parallel Corpus (Chinese-English part) show the effectiveness and superiority of our proposed AFSP.

CVDec 14, 2024
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Haoxian Ruan, Zhihua Xu, Zhijing Yang et al.

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.

CVMay 27, 2025
Geometry-Editable and Appearance-Preserving Object Compositon

Jianman Lin, Haojie Li, Chunmei Qing et al.

General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

CVApr 8, 2025
Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

Zhihua Xu, Tianshui Chen, Zhijing Yang et al.

The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

CVJan 19
Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

Zhenxuan Lu, Zhihua Xu, Zhijing Yang et al.

Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.

CVSep 1, 2025
Neural Scene Designer: Self-Styled Semantic Image Manipulation

Jianman Lin, Tianshui Chen, Chunmei Qing et al.

Maintaining stylistic consistency is crucial for the cohesion and aesthetic appeal of images, a fundamental requirement in effective image editing and inpainting. However, existing methods primarily focus on the semantic control of generated content, often neglecting the critical task of preserving this consistency. In this work, we introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions while ensuring both semantic alignment with user intent and stylistic consistency with the surrounding environment. NSD leverages an advanced diffusion model, incorporating two parallel cross-attention mechanisms that separately process text and style information to achieve the dual objectives of semantic control and style consistency. To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module. This module is predicated on the intuitive premise that different regions within a single image share a consistent style, whereas regions from different images exhibit distinct styles. The PSRL module employs a style contrastive loss that encourages high similarity between representations from the same image while enforcing dissimilarity between those from different images. Furthermore, to address the lack of standardized evaluation protocols for this task, we establish a comprehensive benchmark. This benchmark includes competing algorithms, dedicated style-related metrics, and diverse datasets and settings to facilitate fair comparisons. Extensive experiments conducted on our benchmark demonstrate the effectiveness of the proposed framework.

CVJul 31, 2025
Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels

Haoxian Ruan, Zhihua Xu, Zhijing Yang et al.

Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.

CVMay 24, 2023
NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution

Yexing Song, Meilin Wang, Zhijing Yang et al.

The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.

LGMay 19, 2023
Justices for Information Bottleneck Theory

Faxian Cao, Yongqiang Cheng, Adil Mehmood Khan et al.

This study comes as a timely response to mounting criticism of the information bottleneck (IB) theory, injecting fresh perspectives to rectify misconceptions and reaffirm its validity. Firstly, we introduce an auxiliary function to reinterpret the maximal coding rate reduction method as a special yet local optimal case of IB theory. Through this auxiliary function, we clarify the paradox of decreasing mutual information during the application of ReLU activation in deep learning (DL) networks. Secondly, we challenge the doubts about IB theory's applicability by demonstrating its capacity to explain the absence of a compression phase with linear activation functions in hidden layers, when viewed through the lens of the auxiliary function. Lastly, by taking a novel theoretical stance, we provide a new way to interpret the inner organizations of DL networks by using IB theory, aligning them with recent experimental evidence. Thus, this paper serves as an act of justice for IB theory, potentially reinvigorating its standing and application in DL and other fields such as communications and biomedical research.

CVSep 25, 2020
AIM 2020 Challenge on Real Image Super-Resolution: Methods and Results

Pengxu Wei, Hannan Lu, Radu Timofte et al.

This paper introduces the real image Super-Resolution (SR) challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2020. This challenge involves three tracks to super-resolve an input image for $\times$2, $\times$3 and $\times$4 scaling factors, respectively. The goal is to attract more attention to realistic image degradation for the SR task, which is much more complicated and challenging, and contributes to real-world image super-resolution applications. 452 participants were registered for three tracks in total, and 24 teams submitted their results. They gauge the state-of-the-art approaches for real image SR in terms of PSNR and SSIM.

LGAug 23, 2020
Unsupervised Multi-view Clustering by Squeezing Hybrid Knowledge from Cross View and Each View

Junpeng Tan, Yukai Shi, Zhijing Yang et al.

Multi-view clustering methods have been a focus in recent years because of their superiority in clustering performance. However, typical traditional multi-view clustering algorithms still have shortcomings in some aspects, such as removal of redundant information, utilization of various views and fusion of multi-view features. In view of these problems, this paper proposes a new multi-view clustering method, low-rank subspace multi-view clustering based on adaptive graph regularization. We construct two new data matrix decomposition models into a unified optimization model. In this framework, we address the significance of the common knowledge shared by the cross view and the unique knowledge of each view by presenting new low-rank and sparse constraints on the sparse subspace matrix. To ensure that we achieve effective sparse representation and clustering performance on the original data matrix, adaptive graph regularization and unsupervised clustering constraints are also incorporated in the proposed model to preserve the internal structural features of the data. Finally, the proposed method is compared with several state-of-the-art algorithms. Experimental results for five widely used multi-view benchmarks show that our proposed algorithm surpasses other state-of-the-art methods by a clear margin.

IVMay 3, 2020
NTIRE 2020 Challenge on Perceptual Extreme Super-Resolution: Methods and Results

Kai Zhang, Shuhang Gu, Radu Timofte et al.

This paper reviews the NTIRE 2020 challenge on perceptual extreme super-resolution with focus on proposed solutions and results. The challenge task was to super-resolve an input image with a magnification factor 16 based on a set of prior examples of low and corresponding high resolution images. The goal is to obtain a network design capable to produce high resolution results with the best perceptual quality and similar to the ground truth. The track had 280 registered participants, and 19 teams submitted the final results. They gauge the state-of-the-art in single image super-resolution.

CVFeb 25, 2020
DDet: Dual-path Dynamic Enhancement Network for Real-World Image Super-Resolution

Yukai Shi, Haoyu Zhong, Zhijing Yang et al.

Different from traditional image super-resolution task, real image super-resolution(Real-SR) focus on the relationship between real-world high-resolution(HR) and low-resolution(LR) image. Most of the traditional image SR obtains the LR sample by applying a fixed down-sampling operator. Real-SR obtains the LR and HR image pair by incorporating different quality optical sensors. Generally, Real-SR has more challenges as well as broader application scenarios. Previous image SR methods fail to exhibit similar performance on Real-SR as the image data is not aligned inherently. In this article, we propose a Dual-path Dynamic Enhancement Network(DDet) for Real-SR, which addresses the cross-camera image mapping by realizing a dual-way dynamic sub-pixel weighted aggregation and refinement. Unlike conventional methods which stack up massive convolutional blocks for feature representation, we introduce a content-aware framework to study non-inherently aligned image pair in image SR issue. First, we use a content-adaptive component to exhibit the Multi-scale Dynamic Attention(MDA). Second, we incorporate a long-term skip connection with a Coupled Detail Manipulation(CDM) to perform collaborative compensation and manipulation. The above dual-path model is joint into a unified model and works collaboratively. Extensive experiments on the challenging benchmarks demonstrate the superiority of our model.

LGMar 19, 2019
POP-CNN: Predicting Odor's Pleasantness with Convolutional Neural Network

Danli Wu, Yu Cheng, Dehan Luo et al.

Predicting odor's pleasantness simplifies the evaluation of odors and has the potential to be applied in perfumes and environmental monitoring industry. Classical algorithms for predicting odor's pleasantness generally use a manual feature extractor and an independent classifier. Manual designing a good feature extractor depend on expert knowledge and experience is the key to the accuracy of the algorithms. In order to circumvent this difficulty, we proposed a model for predicting odor's pleasantness by using convolutional neural network. In our model, the convolutional neural layers replace manual feature extractor and show better performance. The experiments show that the correlation between our model and human is over 90% on pleasantness rating. And our model has 99.9% accuracy in distinguishing between absolutely pleasant or unpleasant odors.

NENov 22, 2018
Conditioning Optimization of Extreme Learning Machine by Multitask Beetle Antennae Swarm Algorithm

Xixian Zhang, Zhijing Yang, Faxian Cao et al.

Extreme learning machine (ELM) as a simple and rapid neural network has been shown its good performance in various areas. Different from the general single hidden layer feedforward neural network (SLFN), the input weights and biases in hidden layer of ELM are generated randomly, so that it only takes a little computation overhead to train the model. However, the strategy of selecting input weights and biases at random may result in ill-posed problem. Aiming to optimize the conditioning of ELM, we propose an effective particle swarm heuristic algorithm called Multitask Beetle Antennae Swarm Algorithm (MBAS), which is inspired by the structures of artificial bee colony (ABS) algorithm and Beetle Antennae Search (BAS) algorithm. Then, the proposed MBAS is applied to optimize the input weights and biases of ELM. Experiment results show that the proposed method is capable of simultaneously reducing the condition number and regression error, and achieving good generalization performances.

CVOct 9, 2017
Does Normalization Methods Play a Role for Hyperspectral Image Classification?

Faxian Cao, Zhijing Yang, Jinchang Ren et al.

For Hyperspectral image (HSI) datasets, each class have their salient feature and classifiers classify HSI datasets according to the class's saliency features, however, there will be different salient features when use different normalization method. In this letter, we report the effect on classifiers by different normalization methods and recommend the best normalization methods for classifier after analyzing the impact of different normalization methods on classifiers. Pavia University datasets, Indian Pines datasets and Kennedy Space Center datasets will apply to several typical classifiers in order to evaluate and analysis the impact of different normalization methods on typical classifiers.

CVSep 12, 2017
Sparse Representation Based Augmented Multinomial Logistic Extreme Learning Machine with Weighted Composite Features for Spectral Spatial Hyperspectral Image Classification

Faxian Cao, Zhijing Yang, Jinchang Ren et al.

Although extreme learning machine (ELM) has been successfully applied to a number of pattern recognition problems, it fails to pro-vide sufficient good results in hyperspectral image (HSI) classification due to two main drawbacks. The first is due to the random weights and bias of ELM, which may lead to ill-posed problems. The second is the lack of spatial information for classification. To tackle these two problems, in this paper, we propose a new framework for ELM based spectral-spatial classification of HSI, where probabilistic modelling with sparse representation and weighted composite features (WCF) are employed respectively to derive the op-timized output weights and extract spatial features. First, the ELM is represented as a concave logarithmic likelihood function under statistical modelling using the maximum a posteriori (MAP). Second, the sparse representation is applied to the Laplacian prior to effi-ciently determine a logarithmic posterior with a unique maximum in order to solve the ill-posed problem of ELM. The variable splitting and the augmented Lagrangian are subsequently used to further reduce the computation complexity of the proposed algorithm and it has been proven a more efficient method for speed improvement. Third, the spatial information is extracted using the weighted compo-site features (WCFs) to construct the spectral-spatial classification framework. In addition, the lower bound of the proposed method is derived by a rigorous mathematical proof. Experimental results on two publicly available HSI data sets demonstrate that the proposed methodology outperforms ELM and a number of state-of-the-art approaches.

CVSep 8, 2017
Extreme Sparse Multinomial Logistic Regression: A Fast and Robust Framework for Hyperspectral Image Classification

Faxian Cao, Zhijing Yang, Jinchang Ren et al.

Although the sparse multinomial logistic regression (SMLR) has provided a useful tool for sparse classification, it suffers from inefficacy in dealing with high dimensional features and manually set initial regressor values. This has significantly constrained its applications for hyperspectral image (HSI) classification. In order to tackle these two drawbacks, an extreme sparse multinomial logistic regression (ESMLR) is proposed for effective classification of HSI. First, the HSI dataset is projected to a new feature space with randomly generated weight and bias. Second, an optimization model is established by the Lagrange multiplier method and the dual principle to automatically determine a good initial regressor for SMLR via minimizing the training error and the regressor value. Furthermore, the extended multi-attribute profiles (EMAPs) are utilized for extracting both the spectral and spatial features. A combinational linear multiple features learning (MFL) method is proposed to further enhance the features extracted by ESMLR and EMAPs. Finally, the logistic regression via the variable splitting and the augmented Lagrangian (LORSAL) is adopted in the proposed framework for reducing the computational time. Experiments are conducted on two well-known HSI datasets, namely the Indian Pines dataset and the Pavia University dataset, which have shown the fast and robust performance of the proposed ESMLR framework.

CVSep 5, 2017
Linear vs Nonlinear Extreme Learning Machine for Spectral-Spatial Classification of Hyperspectral Image

Faxian Cao, Zhijing Yang, Jinchang Ren et al.

As a new machine learning approach, extreme learning machine (ELM) has received wide attentions due to its good performances. However, when directly applied to the hyperspectral image (HSI) classification, the recognition rate is too low. This is because ELM does not use the spatial information which is very important for HSI classification. In view of this, this paper proposes a new framework for spectral-spatial classification of HSI by combining ELM with loopy belief propagation (LBP). The original ELM is linear, and the nonlinear ELMs (or Kernel ELMs) are the improvement of linear ELM (LELM). However, based on lots of experiments and analysis, we found out that the LELM is a better choice than nonlinear ELM for spectral-spatial classification of HSI. Furthermore, we exploit the marginal probability distribution that uses the whole information in the HSI and learn such distribution using the LBP. The proposed method not only maintain the fast speed of ELM, but also greatly improves the accuracy of classification. The experimental results in the well-known HSI data sets, Indian Pines and Pavia University, demonstrate the good performances of the proposed method.