Yuanyuan Liu

CV
h-index34
60papers
1,549citations
Novelty56%
AI Score59

60 Papers

CVNov 24, 2022Code
Pose-disentangled Contrastive Learning for Self-supervised Facial Representation

Yuanyuan Liu, Wenbin Wang, Yibing Zhan et al.

Self-supervised facial representation has recently attracted increasing attention due to its ability to perform face understanding without relying on large-scale annotated datasets heavily. However, analytically, current contrastive-based self-supervised learning (SSL) still performs unsatisfactorily for learning facial representation. More specifically, existing contrastive learning (CL) tends to learn pose-invariant features that cannot depict the pose details of faces, compromising the learning performance. To conquer the above limitation of CL, we propose a novel Pose-disentangled Contrastive Learning (PCL) method for general self-supervised facial representation. Our PCL first devises a pose-disentangled decoder (PDD) with a delicately designed orthogonalizing regulation, which disentangles the pose-related features from the face-aware features; therefore, pose-related and other pose-unrelated facial information could be performed in individual subnetworks and do not affect each other's training. Furthermore, we introduce a pose-related contrastive learning scheme that learns pose-related information based on data augmentation of the same image, which would deliver more effective face-aware representation for various downstream tasks. We conducted linear evaluation on four challenging downstream facial understanding tasks, ie, facial expression recognition, face recognition, AU detection and head pose estimation. Experimental results demonstrate that our method significantly outperforms state-of-the-art SSL methods. Code is available at https://github.com/DreamMr/PCL}{https://github.com/DreamMr/PCL

CVJun 8, 2023Code
Boosting Adversarial Transferability by Achieving Flat Local Maxima

Zhijin Ge, Hongying Liu, Xiaosen Wang et al.

Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the observation that flat local minima are correlated with good generalization, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability by introducing a penalized gradient norm to the original loss function. Since directly optimizing the gradient regularization norm is computationally expensive and intractable for generating adversarial examples, we propose an approximation optimization method to simplify the gradient update of the objective function. Specifically, we randomly sample an example and adopt a first-order procedure to approximate the curvature of Hessian/vector product, which makes computing more efficient by interpolating two neighboring gradients. Meanwhile, in order to obtain a more stable gradient direction, we randomly sample multiple examples and average the gradients of these examples to reduce the variance due to random sampling during the iterative process. Extensive experimental results on the ImageNet-compatible dataset show that the proposed method can generate adversarial examples at flat local regions, and significantly improve the adversarial transferability on either normally trained models or adversarially trained models than the state-of-the-art attacks. Our codes are available at: https://github.com/Trustworthy-AI-Group/PGN.

CVAug 21, 2023Code
Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

Zhijin Ge, Fanhua Shang, Hongying Liu et al.

Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.

CVAug 1, 2022
MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild

Yuanyuan Liu, Wei Dai, Chuanxu Feng et al.

Dynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one modality, e.g., videos. The monotonous labels and modality cannot accurately imitate human emotions and fulfill applications in the real world. In this paper, we propose MAFW, a large-scale multi-modal compound affective database with 10,045 video-audio clips in the wild. Each clip is annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment. To ensure high quality of the labels, we filter out the unreliable annotations by an Expectation Maximization (EM) algorithm, and then obtain 11 single-label emotion categories and 32 multi-label emotion categories. To the best of our knowledge, MAFW is the first in-the-wild multi-modal database annotated with compound emotion annotations and emotion-related captions. Additionally, we also propose a novel Transformer-based expression snippet feature learning method to recognize the compound emotions leveraging the expression-change relations among different emotions and modalities. Extensive experiments on MAFW database show the advantages of the proposed method over other state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database is publicly available from https://mafw-database.github.io/MAFW.

LGFeb 22Code
Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data

Junkang Liu, Fanhua Shang, Hongying Liu et al.

Second-order optimizers can significantly accelerate large-scale training, yet their naive federated variants are often unstable or even diverge on non-IID data. We show that a key culprit is \emph{preconditioner drift}: client-side second-order training induces heterogeneous \emph{curvature-defined geometries} (i.e., preconditioner coordinate systems), and server-side model averaging updates computed under incompatible metrics, corrupting the global descent direction. To address this geometric mismatch, we propose \texttt{FedPAC}, a \emph{preconditioner alignment and correction} framework for reliable federated second-order optimization. \texttt{FedPAC} explicitly decouples parameter aggregation from geometry synchronization by: (i) \textbf{Alignment} (i.e.,aggregating local preconditioners into a global reference and warm-starting clients via global preconditioner); and (ii) \textbf{Correction} (i.e., steering local preconditioned updates using a global preconditioned direction to suppress long-term drift). We provide drift-coupled non-convex convergence guarantees with linear speedup under partial participation. Empirically, \texttt{FedPAC} consistently improves stability and accuracy across vision and language tasks, achieving up to $5.8\%$ absolute accuracy gain on CIFAR-100 with ViTs. Code is available at https://anonymous.4open.science/r/FedPAC-8B24.

LGOct 31, 2025Code
DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm

Junkang Liu, Yuxuan Tian, Fanhua Shang et al.

To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use Rényi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/DP-FedPGN

LGOct 31, 2025Code
FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

Junkang Liu, Fanhua Shang, Kewen Zhu et al.

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\boldsymbol{v}$, $\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \texttt{FedAdamW} efficiently aggregates the \texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \texttt{FedAdamW} achieves a linear speedup convergence rate of $\mathcal{O}(\sqrt{(L Δσ_l^2)/(S K R ε^2)}+(L Δ)/R)$ without \textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.

CVOct 12, 2022
Explore Contextual Information for 3D Scene Graph Generation

Yuanyuan Liu, Chengjiang Long, Zhaoxuan Zhang et al.

3D scene graph generation (SGG) has been of high interest in computer vision. Although the accuracy of 3D SGG on coarse classification and single relation label has been gradually improved, the performance of existing works is still far from being perfect for fine-grained and multi-label situations. In this paper, we propose a framework fully exploring contextual information for the 3D SGG task, which attempts to satisfy the requirements of fine-grained entity class, multiple relation labels, and high accuracy simultaneously. Our proposed approach is composed of a Graph Feature Extraction module and a Graph Contextual Reasoning module, achieving appropriate information-redundancy feature extraction, structured organization, and hierarchical inferring. Our approach achieves superior or competitive performance over previous methods on the 3DSSG dataset, especially on the relationship prediction sub-task.

CVJun 29, 2023
A Fast Fourier Convolutional Deep Neural Network For Accurate and Explainable Discrimination Of Wheat Yellow Rust And Nitrogen Deficiency From Sentinel-2 Time-Series Data

Yue Shi, Liangxiu Han, Pablo González-Moreno et al.

Accurate and timely detection of plant stress is essential for yield protection, allowing better-targeted intervention strategies. Recent advances in remote sensing and deep learning have shown great potential for rapid non-invasive detection of plant stress in a fully automated and reproducible manner. However, the existing models always face several challenges: 1) computational inefficiency and the misclassifications between the different stresses with similar symptoms; and 2) the poor interpretability of the host-stress interaction. In this work, we propose a novel fast Fourier Convolutional Neural Network (FFDNN) for accurate and explainable detection of two plant stresses with similar symptoms (i.e. Wheat Yellow Rust And Nitrogen Deficiency). Specifically, unlike the existing CNN models, the main components of the proposed model include: 1) a fast Fourier convolutional block, a newly fast Fourier transformation kernel as the basic perception unit, to substitute the traditional convolutional kernel to capture both local and global responses to plant stress in various time-scale and improve computing efficiency with reduced learning parameters in Fourier domain; 2) Capsule Feature Encoder to encapsulate the extracted features into a series of vector features to represent part-to-whole relationship with the hierarchical structure of the host-stress interactions of the specific stress. In addition, in order to alleviate over-fitting, a photochemical vegetation indices-based filter is placed as pre-processing operator to remove the non-photochemical noises from the input Sentinel-2 time series.

CVJul 31, 2024
Deep Learning-Based Longitudinal Prediction of Childhood Myopia Progression Using Fundus Image Sequences and Baseline Refraction Data

Mengtian Kang, Yansong Hu, Shuo Gao et al.

Childhood myopia constitutes a significant global health concern. It exhibits an escalating prevalence and has the potential to evolve into severe, irreversible conditions that detrimentally impact familial well-being and create substantial economic costs. Contemporary research underscores the importance of precisely predicting myopia progression to enable timely and effective interventions, thereby averting severe visual impairment in children. Such predictions predominantly rely on subjective clinical assessments, which are inherently biased and resource-intensive, thus hindering their widespread application. In this study, we introduce a novel, high-accuracy method for quantitatively predicting the myopic trajectory and myopia risk in children using only fundus images and baseline refraction data. This approach was validated through a six-year longitudinal study of 3,408 children in Henan, utilizing 16,211 fundus images and corresponding refractive data. Our method based on deep learning demonstrated predictive accuracy with an error margin of 0.311D per year and AUC scores of 0.944 and 0.995 for forecasting the risks of developing myopia and high myopia, respectively. These findings confirm the utility of our model in supporting early intervention strategies and in significantly reducing healthcare costs, particularly by obviating the need for additional metadata and repeated consultations. Furthermore, our method was designed to rely only on fundus images and refractive error data, without the need for meta data or multiple inquiries from doctors, strongly reducing the associated medical costs and facilitating large-scale screening. Our model can even provide good predictions based on only a single time measurement. Consequently, the proposed method is an important means to reduce medical inequities caused by economic disparities.

AIAug 9, 2023
A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem

Zhang-Hua Fu, Sipeng Sun, Jintong Ren et al.

For prohibitively large-scale Travelling Salesman Problems (TSPs), existing algorithms face big challenges in terms of both computational efficiency and solution quality. To address this issue, we propose a hierarchical destroy-and-repair (HDR) approach, which attempts to improve an initial solution by applying a series of carefully designed destroy-and-repair operations. A key innovative concept is the hierarchical search framework, which recursively fixes partial edges and compresses the input instance into a small-scale TSP under some equivalence guarantee. This neat search framework is able to deliver highly competitive solutions within a reasonable time. Fair comparisons based on nineteen famous large-scale instances (with 10,000 to 10,000,000 cities) show that HDR is highly competitive against existing state-of-the-art TSP algorithms, in terms of both efficiency and solution quality. Notably, on two large instances with 3,162,278 and 10,000,000 cities, HDR breaks the world records (i.e., best-known results regardless of computation time), which were previously achieved by LKH and its variants, while HDR is completely independent of LKH. Finally, ablation studies are performed to certify the importance and validity of the hierarchical search framework.

CVSep 2, 2022
PCDNF: Revisiting Learning-based Point Cloud Denoising via Joint Normal Filtering

Zheng Liu, Yaowu Zhao, Sijing Zhan et al.

Recovering high quality surfaces from noisy point clouds, known as point cloud denoising, is a fundamental yet challenging problem in geometry processing. Most of the existing methods either directly denoise the noisy input or filter raw normals followed by updating point positions. Motivated by the essential interplay between point cloud denoising and normal filtering, we revisit point cloud denoising from a multitask perspective, and propose an end-to-end network, named PCDNF, to denoise point clouds via joint normal filtering. In particular, we introduce an auxiliary normal filtering task to help the overall network remove noise more effectively while preserving geometric features more accurately. In addition to the overall architecture, our network has two novel modules. On one hand, to improve noise removal performance, we design a shape-aware selector to construct the latent tangent space representation of the specific point by comprehensively considering the learned point and normal features and geometry priors. On the other hand, point features are more suitable for describing geometric details, and normal features are more conducive for representing geometric structures (e.g., sharp edges and corners). Combining point and normal features allows us to overcome their weaknesses. Thus, we design a feature refinement module to fuse point and normal features for better recovering geometric information. Extensive evaluations, comparisons, and ablation studies demonstrate that the proposed method outperforms state-of-the-arts for both point cloud denoising and normal filtering.

AIOct 9, 2023
Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

Haoyu Zhang, Yu Wang, Guanghao Yin et al.

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

LGOct 31, 2025
FedMuon: Accelerating Federated Learning with Matrix Orthogonalization

Junkang Liu, Fanhua Shang, Junchao Zhou et al.

The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.

CVAug 30, 2023
Physics-Informed DeepMRI: Bridging the Gap from Heat Diffusion to k-Space Interpolation

Zhuo-Xu Cui, Congcong Liu, Xiaohong Fan et al.

In the field of parallel imaging (PI), alongside image-domain regularization methods, substantial research has been dedicated to exploring $k$-space interpolation. However, the interpretability of these methods remains an unresolved issue. Furthermore, these approaches currently face acceleration limitations that are comparable to those experienced by image-domain methods. In order to enhance interpretability and overcome the acceleration limitations, this paper introduces an interpretable framework that unifies both $k$-space interpolation techniques and image-domain methods, grounded in the physical principles of heat diffusion equations. Building upon this foundational framework, a novel $k$-space interpolation method is proposed. Specifically, we model the process of high-frequency information attenuation in $k$-space as a heat diffusion equation, while the effort to reconstruct high-frequency information from low-frequency regions can be conceptualized as a reverse heat equation. However, solving the reverse heat equation poses a challenging inverse problem. To tackle this challenge, we modify the heat equation to align with the principles of magnetic resonance PI physics and employ the score-based generative method to precisely execute the modified reverse heat diffusion. Finally, experimental validation conducted on publicly available datasets demonstrates the superiority of the proposed approach over traditional $k$-space interpolation methods, deep learning-based $k$-space interpolation methods, and conventional diffusion models in terms of reconstruction accuracy, particularly in high-frequency regions.

CVDec 18, 2025Code
Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu, Yuanyuan Liu, Lin Wei et al.

Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

LGMar 5Code
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Junkang Liu, Fanhua Shang, Yuanyuan Liu et al.

Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

CVAug 11, 2022
K-UNN: k-Space Interpolation With Untrained Neural Network

Zhuo-Xu Cui, Sen Jia, Qingyong Zhu et al.

Recently, untrained neural networks (UNNs) have shown satisfactory performances for MR image reconstruction on random sampling trajectories without using additional full-sampled training data. However, the existing UNN-based approach does not fully use the MR image physical priors, resulting in poor performance in some common scenarios (e.g., partial Fourier, regular sampling, etc.) and the lack of theoretical guarantees for reconstruction accuracy. To bridge this gap, we propose a safeguarded k-space interpolation method for MRI using a specially designed UNN with a tripled architecture driven by three physical priors of the MR images (or k-space data), including sparsity, coil sensitivity smoothness, and phase smoothness. We also prove that the proposed method guarantees tight bounds for interpolated k-space data accuracy. Finally, ablation experiments show that the proposed method can more accurately characterize the physical priors of MR images than existing traditional methods. Additionally, under a series of commonly used sampling trajectories, experiments also show that the proposed method consistently outperforms traditional parallel imaging methods and existing UNNs, and even outperforms the state-of-the-art supervised-trained k-space deep learning methods in some cases.

CVDec 19, 2025Code
ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image

Yunqi Gao, Leyuan Liu, Yuhan Li et al.

With 3D data rapidly emerging as an important form of multimedia information, 3D human mesh recovery technology has also advanced accordingly. However, current methods mainly focus on handling humans wearing tight clothing and perform poorly when estimating body shapes and poses under diverse clothing, especially loose garments. To this end, we make two key insights: (1) tailoring clothing to fit the human body can mitigate the adverse impact of clothing on 3D human mesh recovery, and (2) utilizing human visual information from large foundational models can enhance the generalization ability of the estimation. Based on these insights, we propose ClothHMR, to accurately recover 3D meshes of humans in diverse clothing. ClothHMR primarily consists of two modules: clothing tailoring (CT) and FHVM-based mesh recovering (MR). The CT module employs body semantic estimation and body edge prediction to tailor the clothing, ensuring it fits the body silhouette. The MR module optimizes the initial parameters of the 3D human mesh by continuously aligning the intermediate representations of the 3D mesh with those inferred from the foundational human visual model (FHVM). ClothHMR can accurately recover 3D meshes of humans wearing diverse clothing, precisely estimating their body shapes and poses. Experimental results demonstrate that ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images. Additionally, a web application for online fashion and shopping powered by ClothHMR is developed, illustrating that ClothHMR can effectively serve real-world usage scenarios. The code and model for ClothHMR are available at: \url{https://github.com/starVisionTeam/ClothHMR}.

LGJul 26, 2025Code
FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Liu junkang, Yuanyuan Liu, Fanhua Shang et al.

For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: https://github.com/junkangLiu0/FedSWA.

MLOct 24, 2023
Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler

Jiayu Qian, Yuanyuan Liu, Jingya Yang et al.

Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.

CVDec 10, 2025
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu, Haiyang Mei, Dongyang Zhan et al.

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

CVApr 26, 2024Code
Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Yuanyuan Liu, Yuxuan Huang, Shuyang Liu et al.

In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

84.6AIApr 17
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Qiang Xu, Shengyuan Bai, Yu Wang et al.

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.

CVSep 30, 2024
DIG-FACE: De-biased Learning for Generalized Facial Expression Category Discovery

Tingzhang Luo, Yichao Liu, Yuanyuan Liu et al.

We introduce a novel task, Generalized Facial Expression Category Discovery (G-FACE), that discovers new, unseen facial expressions while recognizing known categories effectively. Even though there are generalized category discovery methods for natural images, they show compromised performance on G-FACE. We identified two biases that affect the learning: implicit bias, coming from an underlying distributional gap between new categories in unlabeled data and known categories in labeled data, and explicit bias, coming from shifted preference on explicit visual facial change characteristics from known expressions to unknown expressions. By addressing the challenges caused by both biases, we propose a Debiased G-FACE method, namely DIG-FACE, that facilitates the debiasing of both implicit and explicit biases. In the implicit debiasing process of DIG-FACE, we devise a novel learning strategy that aims at estimating and minimizing the upper bound of implicit bias. In the explicit debiasing process, we optimize the model's ability to handle nuanced visual facial expression data by introducing a hierarchical category-discrimination refinement strategy: sample-level, triplet-level, and distribution-level optimizations. Extensive experiments demonstrate that our DIG-FACE significantly enhances recognition accuracy for both known and new categories, setting a first-of-its-kind standard for the task.

LGNov 14, 2025
When Genes Speak: A Semantic-Guided Framework for Spatially Resolved Transcriptomics Data Clustering

Jiangkai Long, Yanran Zhu, Chang Tang et al.

Spatial transcriptomics enables gene expression profiling with spatial context, offering unprecedented insights into the tissue microenvironment. However, most computational models treat genes as isolated numerical features, ignoring the rich biological semantics encoded in their symbols. This prevents a truly deep understanding of critical biological characteristics. To overcome this limitation, we present SemST, a semantic-guided deep learning framework for spatial transcriptomics data clustering. SemST leverages Large Language Models (LLMs) to enable genes to "speak" through their symbolic meanings, transforming gene sets within each tissue spot into biologically informed embeddings. These embeddings are then fused with the spatial neighborhood relationships captured by Graph Neural Networks (GNNs), achieving a coherent integration of biological function and spatial structure. We further introduce the Fine-grained Semantic Modulation (FSM) module to optimally exploit these biological priors. The FSM module learns spot-specific affine transformations that empower the semantic embeddings to perform an element-wise calibration of the spatial features, thus dynamically injecting high-order biological knowledge into the spatial context. Extensive experiments on public spatial transcriptomics datasets show that SemST achieves state-of-the-art clustering performance. Crucially, the FSM module exhibits plug-and-play versatility, consistently improving the performance when integrated into other baseline methods.

CVAug 24, 2020Code
A Single Frame and Multi-Frame Joint Network for 360-degree Panorama Video Super-Resolution

Hongying Liu, Zhubo Ruan, Chaowei Fang et al.

Spherical videos, also known as \ang{360} (panorama) videos, can be viewed with various virtual reality devices such as computers and head-mounted displays. They attract large amount of interest since awesome immersion can be experienced when watching spherical videos. However, capturing, storing and transmitting high-resolution spherical videos are extremely expensive. In this paper, we propose a novel single frame and multi-frame joint network (SMFN) for recovering high-resolution spherical videos from low-resolution inputs. To take advantage of pixel-level inter-frame consistency, deformable convolutions are used to eliminate the motion difference between feature maps of the target frame and its neighboring frames. A mixed attention mechanism is devised to enhance the feature representation capability. The dual learning strategy is exerted to constrain the space of solution so that a better solution can be found. A novel loss function based on the weighted mean square error is proposed to emphasize on the super-resolution of the equatorial regions. This is the first attempt to settle the super-resolution of spherical videos, and we collect a novel dataset from the Internet, MiG Panorama Video, which includes 204 videos. Experimental results on 4 representative video clips demonstrate the efficacy of the proposed method. The dataset and code are available at https://github.com/lovepiano/SMFN_For_360VSR.

CVNov 8, 2024
Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors

Yuanyuan Liu, Lin Wei, Kejun Liu et al.

Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the two.Specifically, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.

48.1COMP-PHApr 7
A deep learning framework for jointly solving transient Fokker-Planck equations with arbitrary parameters and initial distributions

Xiaolong Wang, Jing Feng, Qi Liu et al.

Efficiently solving the Fokker-Planck equation (FPE) is central to analyzing complex parameterized stochastic systems. However, current numerical methods lack parallel computation capabilities across varying conditions, severely limiting comprehensive parameter exploration and transient analysis. This paper introduces a deep learning-based pseudo-analytical probability solution (PAPS) that, via a single training process, simultaneously resolves transient FPE solutions for arbitrary multi-modal initial distributions, system parameters, and time points. The core idea is to unify initial, transient, and stationary distributions via Gaussian mixture distributions (GMDs) and develop a constraint-preserving autoencoder that bijectively maps constrained GMD parameters to unconstrained, low-dimensional latent representations. In this representation space, the panoramic transient dynamics across varying initial conditions and system parameters can be modeled by a single evolution network. Extensive experiments on paradigmatic systems demonstrate that the proposed PAPS maintains high accuracy while achieving inference speeds four orders of magnitude faster than GPU-accelerated Monte Carlo simulations. This efficiency leap enables previously intractable real-time parameter sweeps and systematic investigations of stochastic bifurcations. By decoupling representation learning from physics-informed transient dynamics, our work establishes a scalable paradigm for probabilistic modeling of multi-dimensional, parameterized stochastic systems.

CVNov 21, 2025
Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction

Baoqing Li, Yuanyuan Liu, Congcong Liu et al.

Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.

CVOct 8, 2025
Self-supervised Deep Unrolled Model with Implicit Neural Representation Regularization for Accelerating MRI Reconstruction

Jingran Xu, Yuanyuan Liu, Yuanbiao Yang et al.

Magnetic resonance imaging (MRI) is a vital clinical diagnostic tool, yet its application is limited by prolonged scan times. Accelerating MRI reconstruction addresses this issue by reconstructing high-fidelity MR images from undersampled k-space measurements. In recent years, deep learning-based methods have demonstrated remarkable progress. However, most methods rely on supervised learning, which requires large amounts of fully-sampled training data that are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction method named UnrollINR, which enables scan-specific MRI reconstruction without external training data. UnrollINR adopts a physics-guided unrolled reconstruction architecture and introduces implicit neural representation (INR) as a regularization prior to effectively constrain the solution space. This method overcomes the local bias limitation of CNNs in traditional deep unrolled methods and avoids the instability associated with relying solely on INR's implicit regularization in highly ill-posed scenarios. Consequently, UnrollINR significantly improves MRI reconstruction performance under high acceleration rates. Experimental results show that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to supervised and self-supervised learning methods, validating its effectiveness and superiority.

IVJun 12, 2025
DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction

Yuliang Zhu, Jing Cheng, Qi Xie et al.

Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.

MMMay 4, 2023
Noise-Resistant Multimodal Transformer for Emotion Recognition

Yuanyuan Liu, Haoyu Zhang, Yibing Zhan et al.

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a Transformer to incorporate Multimodal Features (MFs) of multimodal inputs based on their relations to the NRGF. Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, on several popular multimodal datasets, our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information is important for effective emotion recognition.

IVMay 4, 2023
Meta-Learning Enabled Score-Based Generative Model for 1.5T-Like Image Reconstruction from 0.5T MRI

Zhuo-Xu Cui, Congcong Liu, Chentao Cao et al.

Magnetic resonance imaging (MRI) is known to have reduced signal-to-noise ratios (SNR) at lower field strengths, leading to signal degradation when producing a low-field MRI image from a high-field one. Therefore, reconstructing a high-field-like image from a low-field MRI is a complex problem due to the ill-posed nature of the task. Additionally, obtaining paired low-field and high-field MR images is often not practical. We theoretically uncovered that the combination of these challenges renders conventional deep learning methods that directly learn the mapping from a low-field MR image to a high-field MR image unsuitable. To overcome these challenges, we introduce a novel meta-learning approach that employs a teacher-student mechanism. Firstly, an optimal-transport-driven teacher learns the degradation process from high-field to low-field MR images and generates pseudo-paired high-field and low-field MRI images. Then, a score-based student solves the inverse problem of reconstructing a high-field-like MR image from a low-field MRI within the framework of iterative regularization, by learning the joint distribution of pseudo-paired images to act as a regularizer. Experimental results on real low-field MRI data demonstrate that our proposed method outperforms state-of-the-art unpaired learning methods.

IVJan 17, 2022
Dual Perceptual Loss for Single Image Super-Resolution Using ESRGAN

Jie Song, Huawei Yi, Wenqian Xu et al.

The proposal of perceptual loss solves the problem that per-pixel difference loss function causes the reconstructed image to be overly-smooth, which acquires a significant progress in the field of single image super-resolution reconstruction. Furthermore, the generative adversarial networks (GAN) is applied to the super-resolution field, which effectively improves the visual quality of the reconstructed image. However, under the condtion of high upscaling factors, the excessive abnormal reasoning of the network produces some distorted structures, so that there is a certain deviation between the reconstructed image and the ground-truth image. In order to fundamentally improve the quality of reconstructed images, this paper proposes a effective method called Dual Perceptual Loss (DP Loss), which is used to replace the original perceptual loss to solve the problem of single image super-resolution reconstruction. Due to the complementary property between the VGG features and the ResNet features, the proposed DP Loss considers the advantages of learning two features simultaneously, which significantly improves the reconstruction effect of images. The qualitative and quantitative analysis on benchmark datasets demonstrates the superiority of our proposed method over state-of-the-art super-resolution methods.

LGDec 18, 2021
Equilibrated Zeroth-Order Unrolled Deep Networks for Accelerated MRI

Zhuo-Xu Cui, Jing Cheng, Qingyong Zhu et al.

Recently, model-driven deep learning unrolls a certain iterative algorithm of a regularization model into a cascade network by replacing the first-order information (i.e., (sub)gradient or proximal operator) of the regularizer with a network module, which appears more explainable and predictable compared to common data-driven networks. Conversely, in theory, there is not necessarily such a functional regularizer whose first-order information matches the replaced network module, which means the network output may not be covered by the original regularization model. Moreover, up to now, there is also no theory to guarantee the global convergence and robustness (regularity) of unrolled networks under realistic assumptions. To bridge this gap, this paper propose to present a safeguarded methodology on network unrolling. Specifically, focusing on accelerated MRI, we unroll a zeroth-order algorithm, of which the network module represents the regularizer itself, so that the network output can be still covered by the regularization model. Furthermore, inspired by the ideal of deep equilibrium models, before backpropagating, we carry out the unrolled iterative network to converge to a fixed point to ensure the convergence. In case the measurement data contains noise, we prove that the proposed network is robust against noisy interference. Finally, numerical experiments show that the proposed network consistently outperforms the state-of-the-art MRI reconstruction methods including traditional regularization methods and other deep learning methods.

IVDec 1, 2021
Highly accelerated MR parametric mapping by undersampling the k-space and reducing the contrast number simultaneously with deep learning

Yanjie Zhu, Haoxiang Li, Yuanyuan Liu et al.

Purpose: To propose a novel deep learning-based method called RG-Net (reconstruction and generation network) for highly accelerated MR parametric mapping by undersampling k-space and reducing the acquired contrast number simultaneously. Methods: The proposed framework consists of a reconstruction module and a generative module. The reconstruction module reconstructs MR images from the acquired few undersampled k-space data with the help of a data prior. The generative module then synthesizes the remaining multi-contrast images from the reconstructed images, where the exponential model is implicitly incorporated into the image generation through the supervision of fully sampled labels. The RG-Net was evaluated on the T1\r{ho} mapping data of knee and brain at different acceleration rates. Regional T1\r{ho} analysis for cartilage and the brain was performed to access the performance of RG-Net. Results: RG-Net yields a high-quality T1\r{ho} map at a high acceleration rate of 17. Compared with the competing methods that only undersample k-space, our framework achieves better performance in T1\r{ho} value analysis. Our method also improves quality of T1\r{ho} maps on patient with glioma. Conclusion: The proposed RG-Net that adopted a new strategy by undersampling k-space and reducing the contrast number simultaneously for fast MR parametric mapping, can achieve a high acceleration rate while maintaining good reconstruction quality. The generative module of our framework can also be used as an insert module in other fast MR parametric mapping methods. Keywords: Deep learning, convolutional neural network, fast MR parametric mapping

CVSep 17, 2021
Expression Snippet Transformer for Robust Video-based Facial Expression Recognition

Yuanyuan Liu, Wenbin Wang, Chuanxu Feng et al.

The recent success of Transformer has provided a new direction to various visual understanding tasks, including video-based facial expression recognition (FER). By modeling visual relations effectively, Transformer has shown its power for describing complicated patterns. However, Transformer still performs unsatisfactorily to notice subtle facial expression movements, because the expression movements of many videos can be too small to extract meaningful spatial-temporal relations and achieve robust performance. To this end, we propose to decompose each video into a series of expression snippets, each of which contains a small number of facial movements, and attempt to augment the Transformer's ability for modeling intra-snippet and inter-snippet visual relations, respectively, obtaining the Expression snippet Transformer (EST). In particular, for intra-snippet modeling, we devise an attention-augmented snippet feature extractor (AA-SFE) to enhance the encoding of subtle facial movements of each snippet by gradually attending to more salient information. In addition, for inter-snippet modeling, we introduce a shuffled snippet order prediction (SSOP) head and a corresponding loss to improve the modeling of subtle motion changes across subsequent snippets by training the Transformer to identify shuffled snippet orders. Extensive experiments on four challenging datasets (i.e., BU-3DFE, MMI, AFEW, and DFEW) demonstrate that our EST is superior to other CNN-based methods, obtaining state-of-the-art performance.

ASAug 16, 2021
Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment

Yuanyuan Liu, Nelly Penttilä, Tiina Ihalainen et al.

Imprecise vowel articulation can be observed in people with Parkinson's disease (PD). Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its assessment. Standard clinical vowel articulation features of vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR), are derived the first two formants of the three corner vowels /a/, /i/ and /u/. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed automatic method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/). The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts' ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech.

LGJun 23, 2021
Behavior Mimics Distribution: Combining Individual and Group Behaviors for Federated Learning

Hua Huang, Fanhua Shang, Yuanyuan Liu et al.

Federated Learning (FL) has become an active and promising distributed machine learning paradigm. As a result of statistical heterogeneity, recent studies clearly show that the performance of popular FL methods (e.g., FedAvg) deteriorates dramatically due to the client drift caused by local updates. This paper proposes a novel Federated Learning algorithm (called IGFL), which leverages both Individual and Group behaviors to mimic distribution, thereby improving the ability to deal with heterogeneity. Unlike existing FL methods, our IGFL can be applied to both client and server optimization. As a by-product, we propose a new attention-based federated learning in the server optimization of IGFL. To the best of our knowledge, this is the first time to incorporate attention mechanisms into federated optimization. We conduct extensive experiments and show that IGFL can significantly improve the performance of existing federated learning methods. Especially when the distributions of data among individuals are diverse, IGFL can improve the classification accuracy by about 13% compared with prior baselines.

LGJun 22, 2021
Learned Interpretable Residual Extragradient ISTA for Sparse Coding

Lin Kong, Wei Sun, Fanhua Shang et al.

Recently, the study on learned iterative shrinkage thresholding algorithm (LISTA) has attracted increasing attentions. A large number of experiments as well as some theories have proved the high efficiency of LISTA for solving sparse coding problems. However, existing LISTA methods are all serial connection. To address this issue, we propose a novel extragradient based LISTA (ELISTA), which has a residual structure and theoretical guarantees. In particular, our algorithm can also provide the interpretability for Res-Net to a certain extent. From a theoretical perspective, we prove that our method attains linear convergence. In practice, extensive empirical results verify the advantages of our method.

CVMar 22, 2021
Large Motion Video Super-Resolution with Dual Subnet and Multi-Stage Communicated Upsampling

Hongying Liu, Peng Zhao, Zhubo Ruan et al.

Video super-resolution (VSR) aims at restoring a video in low-resolution (LR) and improving it to higher-resolution (HR). Due to the characteristics of video tasks, it is very important that motion information among frames should be well concerned, summarized and utilized for guidance in a VSR algorithm. Especially, when a video contains large motion, conventional methods easily bring incoherent results or artifacts. In this paper, we propose a novel deep neural network with Dual Subnet and Multi-stage Communicated Upsampling (DSMC) for super-resolution of videos with large motion. We design a new module named U-shaped residual dense network with 3D convolution (U3D-RDN) for fine implicit motion estimation and motion compensation (MEMC) as well as coarse spatial feature extraction. And we present a new Multi-Stage Communicated Upsampling (MSCU) module to make full use of the intermediate results of upsampling for guiding the VSR. Moreover, a novel dual subnet is devised to aid the training of our DSMC, whose dual loss helps to reduce the solution space as well as enhance the generalization ability. Our experimental results confirm that our method achieves superior performance on videos with large motion compared to state-of-the-art methods.

LGOct 31, 2020
Differentially Private ADMM Algorithms for Machine Learning

Tao Xu, Fanhua Shang, Yuanyuan Liu et al.

In this paper, we study efficient differentially private alternating direction methods of multipliers (ADMM) via gradient perturbation for many machine learning problems. For smooth convex loss functions with (non)-smooth regularization, we propose the first differentially private ADMM (DP-ADMM) algorithm with performance guarantee of $(ε,δ)$-differential privacy ($(ε,δ)$-DP). From the viewpoint of theoretical analysis, we use the Gaussian mechanism and the conversion relationship between Rényi Differential Privacy (RDP) and DP to perform a comprehensive privacy analysis for our algorithm. Then we establish a new criterion to prove the convergence of the proposed algorithms including DP-ADMM. We also give the utility analysis of our DP-ADMM. Moreover, we propose an accelerated DP-ADMM (DP-AccADMM) with the Nesterov's acceleration technique. Finally, we conduct numerical experiments on many real-world datasets to show the privacy-utility tradeoff of the two proposed algorithms, and all the comparative analysis shows that DP-AccADMM converges faster and has a better utility than DP-ADMM, when the privacy budget $ε$ is larger than a threshold.

LGOct 21, 2020
Boosting Gradient for White-Box Adversarial Attacks

Hongying Liu, Zhenyu Zhou, Fanhua Shang et al.

Deep neural networks (DNNs) are playing key roles in various artificial intelligence applications such as image classification and object recognition. However, a growing number of studies have shown that there exist adversarial examples in DNNs, which are almost imperceptibly different from original samples, but can greatly change the network output. Existing white-box attack algorithms can generate powerful adversarial examples. Nevertheless, most of the algorithms concentrate on how to iteratively make the best use of gradients to improve adversarial performance. In contrast, in this paper, we focus on the properties of the widely-used ReLU activation function, and discover that there exist two phenomena (i.e., wrong blocking and over transmission) misleading the calculation of gradients in ReLU during the backpropagation. Both issues enlarge the difference between the predicted changes of the loss function from gradient and corresponding actual changes, and mislead the gradients which results in larger perturbations. Therefore, we propose a universal adversarial example generation method, called ADV-ReLU, to enhance the performance of gradient based white-box attack algorithms. During the backpropagation of the network, our approach calculates the gradient of the loss function versus network input, maps the values to scores, and selects a part of them to update the misleading gradients. Comprehensive experimental results on \emph{ImageNet} demonstrate that our ADV-ReLU can be easily integrated into many state-of-the-art gradient-based white-box attack algorithms, as well as transferred to black-box attack attackers, to further decrease perturbations in the ${\ell _2}$-norm.

CVJul 25, 2020
Video Super Resolution Based on Deep Learning: A Comprehensive Survey

Hongying Liu, Zhubo Ruan, Peng Zhao et al.

In recent years, deep learning has made great progress in many fields such as image recognition, natural language processing, speech recognition and video super-resolution. In this survey, we comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based on deep learning. It is well known that the leverage of information within video frames is important for video super-resolution. Thus we propose a taxonomy and classify the methods into six sub-categories according to the ways of utilizing inter-frame information. Moreover, the architectures and implementation details of all the methods are depicted in detail. Finally, we summarize and compare the performance of the representative VSR method on some benchmark datasets. We also discuss some challenges, which need to be further addressed by researchers in the community of VSR. To the best of our knowledge, this work is the first systematic review on VSR tasks, and it is expected to make a contribution to the development of recent studies in this area and potentially deepen our understanding to the VSR techniques based on deep learning.

LGDec 2, 2019
Efficient Relaxed Gradient Support Pursuit for Sparsity Constrained Non-convex Optimization

Fanhua Shang, Bingkun Wei, Hongying Liu et al.

Large-scale non-convex sparsity-constrained problems have recently gained extensive attention. Most existing deterministic optimization methods (e.g., GraSP) are not suitable for large-scale and high-dimensional problems, and thus stochastic optimization methods with hard thresholding (e.g., SVRGHT) become more attractive. Inspired by GraSP, this paper proposes a new general relaxed gradient support pursuit (RGraSP) framework, in which the sub-algorithm only requires to satisfy a slack descent condition. We also design two specific semi-stochastic gradient hard thresholding algorithms. In particular, our algorithms have much less hard thresholding operations than SVRGHT, and their average per-iteration cost is much lower (i.e., O(d) vs. O(d log(d)) for SVRGHT), which leads to faster convergence. Our experimental results on both synthetic and real-world datasets show that our algorithms are superior to the state-of-the-art gradient hard thresholding methods.

CVMay 24, 2019
Pose-adaptive Hierarchical Attention Network for Facial Expression Recognition

Yuanyuan Liu, Jiyao Peng, Jiabei Zeng et al.

Multi-view facial expression recognition (FER) is a challenging task because the appearance of an expression varies in poses. To alleviate the influences of poses, recent methods either perform pose normalization or learn separate FER classifiers for each pose. However, these methods usually have two stages and rely on good performance of pose estimators. Different from existing methods, we propose a pose-adaptive hierarchical attention network (PhaNet) that can jointly recognize the facial expressions and poses in unconstrained environment. Specifically, PhaNet discovers the most relevant regions to the facial expression by an attention mechanism in hierarchical scales, and the most informative scales are then selected to learn the pose-invariant and expression-discriminative representations. PhaNet is end-to-end trainable by minimizing the hierarchical attention losses, the FER loss and pose loss with dynamically learned loss weights. We validate the effectiveness of the proposed PhaNet on three multi-view datasets (BU-3DFE, Multi-pie, and KDEF) and two in-the-wild FER datasets (AffectNet and SFEW). Extensive experiments demonstrate that our framework outperforms the state-of-the-arts under both within-dataset and cross-dataset settings, achieving the average accuracies of 84.92\%, 93.53\%, 88.5\%, 54.82\% and 31.25\% respectively.

MLDec 4, 2018
Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

Rui Luo, Qiang Zhang, Yuanyuan Liu

We propose a new sampler that integrates the protocol of parallel tempering with the Nosé-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

LGNov 8, 2018
Benchmarking Deep Sequential Models on Volatility Predictions for Financial Time Series

Qiang Zhang, Rui Luo, Yaodong Yang et al.

Volatility is a quantity of measurement for the price movements of stocks or options which indicates the uncertainty within financial markets. As an indicator of the level of risk or the degree of variation, volatility is important to analyse the financial market, and it is taken into consideration in various decision-making processes in financial activities. On the other hand, recent advancement in deep learning techniques has shown strong capabilities in modelling sequential data, such as speech and natural language. In this paper, we empirically study the applicability of the latest deep structures with respect to the volatility modelling problem, through which we aim to provide an empirical guidance for the theoretical analysis of the marriage between deep learning techniques and financial applications in the future. We examine both the traditional approaches and the deep sequential models on the task of volatility prediction, including the most recent variants of convolutional and recurrent networks, such as the dilated architecture. Accordingly, experiments with real-world stock price datasets are performed on a set of 1314 daily stock series for 2018 days of transaction. The evaluation and comparison are based on the negative log likelihood (NLL) of real-world stock price time series. The result shows that the dilated neural models, including dilated CNN and Dilated RNN, produce most accurate estimation and prediction, outperforming various widely-used deterministic models in the GARCH family and several recently proposed stochastic models. In addition, the high flexibility and rich expressive power are validated in this study.

LGOct 11, 2018
Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications

Fanhua Shang, James Cheng, Yuanyuan Liu et al.

The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as background modeling, photometric stereo and image alignment. And they can be well modeled by a hyper-Laplacian. However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-Lipschitz problems, and makes existing algorithms very slow for large-scale applications. Together with the analytic solutions to lp-norm minimization with two specific values of p, i.e., p=1/2 and p=2/3, we propose two novel bilinear factor matrix norm minimization models for robust principal component analysis. We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten-1/2 and 2/3 quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems. Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm minimization, even when the number of observations is very limited. Finally, we apply our penalties to various low-level vision problems, e.g., text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods.