Nian Shao

h-index5

4papers

61citations

Novelty50%

AI Score47

Ranked #30,843 of 194,257 authors (top 16%)#121 in AS (top 8%)

4 Papers

15.5ASJun 7, 2023Code

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Xian Li, Nian Shao, Xiaofei Li

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.

7.9NAMay 30

A structural bound for cluster robustness of randomized small-block Lanczos

Nian Shao

The Lanczos method is a fast and memory-efficient algorithm for solving large-scale symmetric eigenvalue problems. However, its rapid convergence can deteriorate significantly when computing clustered eigenvalues due to a lack of cluster robustness. A promising strategy to enhance cluster robustness -- without substantially compromising convergence speed or memory efficiency -- is to use a random small-block initial, where the block size is greater than one but still much smaller than the cluster size. This leads to the Randomized Small-Block Lanczos (RSBL) method. Despite its empirical effectiveness, RSBL lacks the comprehensive theoretical understanding already available for single-vector and large-block variants. In this paper, we develop a structural bound that supports the cluster robustness of RSBL by leveraging tools from matrix polynomials. We identify an intrinsic theoretical challenge stemming from the non-commuting nature of matrix multiplication. To provide further insight, we propose a conjectured probabilistic bound for cluster robustness and validate it through empirical experiments. Finally, we discuss how insights into cluster robustness can enhance our understanding of RSBL for both eigenvalue computation and low-rank approximation.

2.3ASFeb 27, 2025Code

CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

Nian Shao, Rui Zhou, Pengyu Wang et al.

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to the speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.Code and audio examples of our model are available online.

1.2ASOct 21, 2021Code

RCT: Random Consistency Training for Semi-supervised Sound Event Detection

Nian Shao, Erfan Loweimi, Xiaofei Li

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a self-consistency loss is proposed to fuse with the teacher-student model to stabilize the training. Second, a hard mixup data augmentation is proposed to account for the additive property of sounds. Third, a random augmentation scheme is applied to flexibly combine different types of data augmentations. Experiments show that the proposed strategy outperform other widely-used strategies.