Pisets: A Robust Speech Recognition System for Lectures and Interviews

Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, Lyudmila Budneva

arXiv:2601.18415v111 citationsh-index: 1Has CodeNAACL

Originality Incremental advance

AI Analysis

This is an incremental improvement for scientists and journalists needing accurate transcriptions of long Russian-language audio.

The authors tackled speech recognition for lectures and interviews by developing a three-component system that combines Wav2Vec2, Audio Spectrogram Transformer, and Whisper with curriculum learning and uncertainty modeling, achieving robust transcription across various acoustic conditions compared to WhisperX and standard Whisper.

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

View on arXiv PDF Code

Similar