Yonggui Zhu

CV
h-index2
7papers
16citations
Novelty40%
AI Score28

7 Papers

CVSep 14, 2022Code
Considering Image Information and Self-similarity: A Compositional Denoising Network

Jiahong Zhang, Yonggui Zhu, Wenshu Yu et al.

Recently, convolutional neural networks (CNNs) have been widely used in image denoising. Existing methods benefited from residual learning and achieved high performance. Much research has been paid attention to optimizing the network architecture of CNN but ignored the limitations of residual learning. This paper suggests two limitations of it. One is that residual learning focuses on estimating noise, thus overlooking the image information. The other is that the image self-similarity is not effectively considered. This paper proposes a compositional denoising network (CDN), whose image information path (IIP) and noise estimation path (NEP) will solve the two problems, respectively. IIP is trained by an image-to-image way to extract image information. For NEP, it utilizes the image self-similarity from the perspective of training. This similarity-based training method constrains NEP to output a similar estimated noise distribution for different image patches with a specific kind of noise. Finally, image information and noise distribution information will be comprehensively considered for image denoising. Experiments show that CDN achieves state-of-the-art results in synthetic and real-world image denoising. Our code will be released on https://github.com/JiaHongZ/CDN.

CVSep 25, 2023
A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Yonggui Zhu, Guofang Li

Effective aggregation of temporal information of consecutive frames is the core of achieving video super-resolution. Many scholars have utilized structures such as sliding windows and recurrent to gather spatio-temporal information of frames. However, although the performance of the constructed VSR models is improving, the size of the models is also increasing, exacerbating the demand on the equipment. Thus, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878M, which is much lower than the current mainstream model for studying video super-resolution. We design forward feature extraction module and backward feature extraction module to collect temporal information between consecutive frames from two directions. Moreover, a new grouping mechanism is proposed to efficiently collect spatio-temporal information of the reference frame and its neighboring frames. The attention supplementation module is presented to further enhance the information gathering range of the model. The feature reconstruction module aims to aggregate information from different directions to reconstruct high-resolution features. Experiments demonstrate that our model achieves state-of-the-art performance on multiple datasets.

NAFeb 4, 2018
Spherical function regularization for parallel MRI reconstruction

Yonggui Zhu, Tuomo Valkonen

From the optimization point of view, a difficulty with parallel MRI with simultaneous coil sensitivity estimation is the multiplicative nature of the non-linear forward operator: the image being reconstructed and the coil sensitivities compete against each other, causing the optimization process to be very sensitive to small perturbations. This can, to some extent, be avoided by regularizing the unknown in a suitably "orthogonal" fashion. In this paper, we introduce such a regularization based on spherical function bases. To perform this regularization, we represent efficient recurrence formulas for spherical Bessel functions and associated Legendre functions. Numerically, we study the solution of the model with non-linear ADMM. We perform various numerical simulations to demonstrate the efficacy of the proposed model in parallel MRI reconstruction.

IVMar 5, 2022
A Novel Dual Dense Connection Network for Video Super-resolution

Guofang Li, Yonggui Zhu

Video super-resolution (VSR) refers to the reconstruction of high-resolution (HR) video from the corresponding low-resolution (LR) video. Recently, VSR has received increasing attention. In this paper, we propose a novel dual dense connection network that can generate high-quality super-resolution (SR) results. The input frames are creatively divided into reference frame, pre-temporal group and post-temporal group, representing information in different time periods. This grouping method provides accurate information of different time periods without causing time information disorder. Meanwhile, we produce a new loss function, which is beneficial to enhance the convergence ability of the model. Experiments show that our model is superior to other advanced models in Vid4 datasets and SPMCS-11 datasets.

CVFeb 15, 2025Code
VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

Ming Meng, Ke Mu, Yonggui Zhu et al.

Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .

CVJun 15, 2024
A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

Ming Meng, Yufei Zhao, Bo Zhang et al.

Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.

CVJan 15, 2019
Resampling detection of recompressed images via dual-stream convolutional neural network

Gang Cao, Antao Zhou, Xianglin Huang et al.

Resampling detection plays an important role in identifying image tampering, such as image splicing. Currently, the resampling detection is still difficult in recompressed images, which are yielded by applying resampling followed by post-JPEG compression to primary JPEG images. Except for the scenario of low quality primary compression, it remains rather challenging due to the widespread use of middle/high quality compression in imaging devices. In this paper, we propose a new convolution neural network (CNN) method to learn the resampling trace features directly from the recompressed images. To this end, a noise extraction layer based on low-order high pass filters is deployed to yield the image residual domain, which is more beneficial to extract manipulation trace features. A dual-stream CNN is presented to capture the resampling trails along different directions, where the horizontal and vertical streams are interleaved and concatenated. Lastly, the learned features are fed into Sigmoid/Softmax layer, which acts as a binary/multiple classifier for achieving the blind detection and parameter estimation of resampling, respectively. Extensive experimental results demonstrate that our proposed method could detect resampling effectively in recompressed images and outperform the state-of-the-art detectors.