Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search
This work improves speech recognition accuracy for Mandarin language tasks, but it is incremental as it builds on existing NAS and Conformer methods.
The authors tackled the problem of designing optimal neural architectures for speech recognition by applying neural architecture search (NAS) to a Conformer backbone, achieving an 11% relative improvement in character error rate on the AISHELL-1 benchmark.
Recently neural architecture search(NAS) has been successfully used in image classification, natural language processing, and automatic speech recognition(ASR) tasks for finding the state-of-the-art(SOTA) architectures than those human-designed architectures. NAS can derive a SOTA and data-specific architecture over validation data from a pre-defined search space with a search algorithm. Inspired by the success of NAS in ASR tasks, we propose a NAS-based ASR framework containing one search space and one differentiable search algorithm called Differentiable Architecture Search(DARTS). Our search space follows the convolution-augmented transformer(Conformer) backbone, which is a more expressive ASR architecture than those used in existing NAS-based ASR frameworks. To improve the performance of our method, a regulation method called Dynamic Search Schedule(DSS) is employed. On a widely used Mandarin benchmark AISHELL-1, our best-searched architecture outperforms the baseline Conform model significantly with about 11% CER relative improvement, and our method is proved to be pretty efficient by the search cost comparisons.