A Theory of Unsupervised Speech Recognition
This work addresses the problem of training instability and hyperparameter sensitivity in ASR-U for researchers, but it is incremental as it builds on existing algorithms without introducing a new method.
The paper tackles the lack of a theoretical framework for unsupervised speech recognition (ASR-U) by proposing one based on random matrix theory and neural tangent kernels, proving learnability conditions and sample complexity bounds, with experiments on synthetic languages providing empirical support.
Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing from studying their properties and addressing such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a general theoretical framework to study the properties of ASR-U systems based on random matrix theory and the theory of neural tangent kernels. Such a framework allows us to prove various learnability conditions and sample complexity bounds of ASR-U. Extensive ASR-U experiments on synthetic languages with three classes of transition graphs provide strong empirical evidence for our theory (code available at cactuswiththoughts/UnsupASRTheory.git).