AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost
This addresses the challenge of balancing quality and cost in ASR for users relying on multiple systems, though it is incremental as it builds on existing ensemble and selection methods.
The paper tackles the problem of selecting optimal automatic speech recognition (ASR) systems per audio segment to improve transcription quality and reduce costs, achieving a 16.2% relative reduction in word error rate, 65% cost savings, and 75% speed improvement compared to using a single-best model.
We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.