ZiYue Zhao

AS
h-index2
7papers
96citations
Novelty51%
AI Score36

7 Papers

LGOct 17, 2022
Principled Pruning of Bayesian Neural Networks through Variational Free Energy Minimization

Jim Beckers, Bart van Erp, Ziyue Zhao et al.

Bayesian model reduction provides an efficient approach for comparing the performance of all nested sub-models of a model, without re-evaluating any of these sub-models. Until now, Bayesian model reduction has been applied mainly in the computational neuroscience community on simple models. In this paper, we formulate and apply Bayesian model reduction to perform principled pruning of Bayesian neural networks, based on variational free energy minimization. Direct application of Bayesian model reduction, however, gives rise to approximation errors. Therefore, a novel iterative pruning algorithm is presented to alleviate the problems arising with naive Bayesian model reduction, as supported experimentally on the publicly available UCI datasets for different inference algorithms. This novel parameter pruning scheme solves the shortcomings of current state-of-the-art pruning methods that are used by the signal processing community. The proposed approach has a clear stopping criterion and minimizes the same objective that is used during training. Next to these benefits, our experiments indicate better model performance in comparison to state-of-the-art pruning schemes.

ASAug 14, 2019Code
Components Loss for Neural Networks in Mask-Based Speech Enhancement

Ziyi Xu, Samy Elshamy, Ziyue Zhao et al.

Estimating time-frequency domain masks for single-channel speech enhancement using deep learning methods has recently become a popular research field with promising results. In this paper, we propose a novel components loss (CL) for the training of neural networks for mask-based speech enhancement. During the training process, the proposed CL offers separate control over preservation of the speech component quality, suppression of the residual noise component, and preservation of a naturally sounding residual noise component. We illustrate the potential of the proposed CL by evaluating a standard convolutional neural network (CNN) for mask-based speech enhancement. The new CL obtains a better and more balanced performance in almost all employed instrumental quality metrics over the baseline losses, the latter comprising the conventional mean squared error (MSE) loss and also auditory-related loss functions, such as the perceptual evaluation of speech quality (PESQ) loss and the recently proposed perceptual weighting filter loss. Particularly, applying the CL offers better speech component quality, better overall enhanced speech perceptual quality, as well as a more naturally sounding residual noise. On average, an at least 0.1 points higher PESQ score on the enhanced speech is obtained while also obtaining a higher SNR improvement by more than 0.5 dB, for seen noise types. This improvement is stronger for unseen noise types, where an about 0.2 points higher PESQ score on the enhanced speech is obtained, while also the output SNR is ahead by more than 0.5 dB. The new proposed CL is easy to implement and code is provided at https://github.com/ifnspaml/Components-Loss.

CVJul 26, 2025
FaRMamba: Frequency-based learning and Reconstruction aided Mamba for Medical Segmentation

Ze Rong, ZiYue Zhao, Zhaoxin Wang et al.

Accurate medical image segmentation remains challenging due to blurred lesion boundaries (LBA), loss of high-frequency details (LHD), and difficulty in modeling long-range anatomical structures (DC-LRSS). Vision Mamba employs one-dimensional causal state-space recurrence to efficiently model global dependencies, thereby substantially mitigating DC-LRSS. However, its patch tokenization and 1D serialization disrupt local pixel adjacency and impose a low-pass filtering effect, resulting in Local High-frequency Information Capture Deficiency (LHICD) and two-dimensional Spatial Structure Degradation (2D-SSD), which in turn exacerbate LBA and LHD. In this work, we propose FaRMamba, a novel extension that explicitly addresses LHICD and 2D-SSD through two complementary modules. A Multi-Scale Frequency Transform Module (MSFM) restores attenuated high-frequency cues by isolating and reconstructing multi-band spectra via wavelet, cosine, and Fourier transforms. A Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) enforces pixel-level reconstruction on the shared Mamba encoder to recover full 2D spatial correlations, enhancing both fine textures and global context. Extensive evaluations on CAMUS echocardiography, MRI-based Mouse-cochlea, and Kvasir-Seg endoscopy demonstrate that FaRMamba consistently outperforms competitive CNN-Transformer hybrids and existing Mamba variants, delivering superior boundary accuracy, detail preservation, and global coherence without prohibitive computational overhead. This work provides a flexible frequency-aware framework for future segmentation models that directly mitigates core challenges in medical imaging.

LGApr 1, 2025
Large EEG-U-Transformer for Time-Step Level Detection Without Pre-Training

Kerui Wu, Ziyue Zhao, Bülent Yener

Electroencephalography (EEG) reflects the brain's functional state, making it a crucial tool for diverse detection applications like seizure detection and sleep stage classification. While deep learning-based approaches have recently shown promise for automated detection, traditional models are often constrained by limited learnable parameters and only achieve modest performance. In contrast, large foundation models showed improved capabilities by scaling up the model size, but required extensive time-consuming pre-training. Moreover, both types of existing methods require complex and redundant post-processing pipelines to convert discrete labels to continuous annotations. In this work, based on the multi-scale nature of EEG events, we propose a simple U-shaped model to efficiently learn representations by capturing both local and global features using convolution and self-attentive modules for sequence-to-sequence modeling. Compared to other window-level classification models, our method directly outputs predictions at the time-step level, eliminating redundant overlapping inferences. Beyond sequence-to-sequence modeling, the architecture naturally extends to window-level classification by incorporating an attention-pooling layer. Such a paradigm shift and model design demonstrated promising efficiency improvement, cross-subject generalization, and state-of-the-art performance in various time-step and window-level classification tasks in the experiment. More impressively, our model showed the capability to be scaled up to the same level as existing large foundation models that have been extensively pre-trained over diverse datasets and outperforms them by solely using the downstream fine-tuning dataset. Our model won 1st place in the 2025 "seizure detection challenge" organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders.

CVMar 6, 2025
Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Ziyue Zhao, Qining Qi, Jianfa Ma

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. Each 3D Gaussian function represents a flexible region of interest and its semantic features, which are iteratively refined by the attention mechanism. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance. Therefore, we consider optimizing GaussianFormer by using unused temporal information. We learn the Spatial-Temporal Self-attention Mechanism from the previous grid-given occupation network and improve it to GaussianFormer. The experiment was conducted with the NuScenes dataset, and the experiment is currently underway.

ASMay 23, 2019
A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement

Ziyue Zhao, Samy Elshamy, Tim Fingscheidt

Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement. The source code for the proposed approach is made available.

ASJun 25, 2018
Convolutional Neural Networks to Enhance Coded Speech

Ziyue Zhao, Huijun Liu, Tim Fingscheidt

Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors, is a challenging task. In this work we propose two postprocessing approaches applying convolutional neural networks (CNNs) either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time domain approach follows an end-to-end fashion, while the cepstral domain approach uses analysis-synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves speech quality (PESQ) by up to 0.25 MOS-LQO points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec (AMR-WB). In a subjective CCR listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to legacy G.711, even better than uncoded speech with statistical significance. The source code for the cepstral domain approach to enhance G.711-coded speech is made available.