66.4NAMay 30
A structural bound for cluster robustness of randomized small-block LanczosNian Shao
The Lanczos method is a fast and memory-efficient algorithm for solving large-scale symmetric eigenvalue problems. However, its rapid convergence can deteriorate significantly when computing clustered eigenvalues due to a lack of cluster robustness. A promising strategy to enhance cluster robustness -- without substantially compromising convergence speed or memory efficiency -- is to use a random small-block initial, where the block size is greater than one but still much smaller than the cluster size. This leads to the Randomized Small-Block Lanczos (RSBL) method. Despite its empirical effectiveness, RSBL lacks the comprehensive theoretical understanding already available for single-vector and large-block variants. In this paper, we develop a structural bound that supports the cluster robustness of RSBL by leveraging tools from matrix polynomials. We identify an intrinsic theoretical challenge stemming from the non-commuting nature of matrix multiplication. To provide further insight, we propose a conjectured probabilistic bound for cluster robustness and validate it through empirical experiments. Finally, we discuss how insights into cluster robustness can enhance our understanding of RSBL for both eigenvalue computation and low-rank approximation.
ASJun 7, 2023Code
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level TasksXian Li, Nian Shao, Xiaofei Li
Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.
ASFeb 27, 2025
CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASRNian Shao, Rui Zhou, Pengyu Wang et al.
In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to the speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.Code and audio examples of our model are available online.
48.9NAApr 1
Stabilizing the Rayleigh--Ritz procedure by randomizationNian Shao
Extracting approximate eigenpairs from a prescribed subspace is of fundamental importance in eigenvalue computation. While projecting the target eigenvector onto the subspace yields satisfactory accuracy, extracting an approximate eigenpair that attains a comparable convergence rate has remained a long-standing open problem. Although the standard Rayleigh--Ritz procedure is widely used for this purpose, it may suffer from deteriorated convergence of Ritz values and may even fail to produce convergent Ritz vectors. In this paper, we address this long-standing open problem by introducing a randomized Rayleigh--Ritz procedure whose output converges at a rate similar to the ideal projection. Our analysis requires only the simplicity of the target eigenvalue and extends naturally to nonlinear eigenvalue problems.
ASOct 21, 2021
RCT: Random Consistency Training for Semi-supervised Sound Event DetectionNian Shao, Erfan Loweimi, Xiaofei Li
Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a self-consistency loss is proposed to fuse with the teacher-student model to stabilize the training. Second, a hard mixup data augmentation is proposed to account for the additive property of sounds. Third, a random augmentation scheme is applied to flexibly combine different types of data augmentations. Experiments show that the proposed strategy outperform other widely-used strategies.