Shuhua Zhang

AS
h-index5
5papers
16citations
Novelty43%
AI Score32

5 Papers

NAMar 17, 2016
A Stationary Accumulated Projection Method for Linear System of Equations

Wujian Peng, Shuhua Zhang

It is shown in this paper that, almost all current prevalent iterative \mbox{methods} for solving linear system of equations can be classified as what we called extended Krylov subspace methods. In this paper a new type of iterative methods are introduced which do not depend on any Krylov subspaces. This type of methods are based on the so-called accumulated projection technique proposed by authors. It overcomes some shortcomings of classical Row-Projection technique and takes full advantages of the linear system. Comparing with traditional Krylov subspace methods which always depend on the matrix-vector multiplication with some fixed matrix, the newly introduced method (SAP) uses different projection matrices which differ in each step in the iteration process to form an approximate solution. More importantly some particular accelerative schemes (named as MSAP1 and MSAP2) are introduced to improve the convergence of the SAP method. Numerical experiments show some surprisingly improved convergence behavior; some superior experimental behavior of MSAP methods over GMRES and block-Jacobi are demonstrated in some situations.

ASOct 29, 2022
Application of Knowledge Distillation to Multi-task Speech Representation Learning

Mine Kerpicci, Van Nguyen, Shuhua Zhang et al.

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that fine-tuning the SRL models results in a significant performance boost compared to using frozen SRL models.

ASSep 18, 2025
Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

Miseul Kim, Soo Jin Park, Kyungguen Byun et al.

Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.

ASOct 3, 2021
Multi-task Voice Activated Framework using Self-supervised Learning

Shehzeen Hussain, Van Nguyen, Shuhua Zhang et al.

Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are learned without any task-specific supervision, they can also be useful for other voice-activated tasks like speaker verification, keyword spotting, emotion classification etc. In our work, we propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks. We develop downstream network architectures that operate on the contextualized speech representations of wav2vec 2.0 to adapt the representations for solving a given task. Finally, we extend our framework to perform multi-task learning by jointly optimizing the network parameters on multiple voice activated tasks using a shared transformer backbone. Both of our single and multi-task frameworks achieve state-of-the-art results in speaker verification and keyword spotting benchmarks. Our best performing models achieve 1.98% and 3.15% EER on VoxCeleb1 test set when trained on VoxCeleb2 and VoxCeleb1 respectively, and 98.23% accuracy on Google Speech Commands v1.0 keyword spotting dataset.

CLSep 17, 2017
Character Distributions of Classical Chinese Literary Texts: Zipf's Law, Genres, and Epochs

Chao-Lin Liu, Shuhua Zhang, Yuanli Geng et al.

We collect 14 representative corpora for major periods in Chinese history in this study. These corpora include poetic works produced in several dynasties, novels of the Ming and Qing dynasties, and essays and news reports written in modern Chinese. The time span of these corpora ranges between 1046 BCE and 2007 CE. We analyze their character and word distributions from the viewpoint of the Zipf's law, and look for factors that affect the deviations and similarities between their Zipfian curves. Genres and epochs demonstrated their influences in our analyses. Specifically, the character distributions for poetic works of between 618 CE and 1644 CE exhibit striking similarity. In addition, although texts of the same dynasty may tend to use the same set of characters, their character distributions still deviate from each other.