SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data
This provides a large-scale dataset and improved models for low-resource Slovak speech recognition, though it is incremental as it applies existing methods to new data.
The authors tackled the problem of limited training data for Slovak automatic speech recognition by creating SloPalSpeech, a 2,806-hour speech corpus from parliamentary proceedings, and used it to fine-tune Whisper models, achieving up to 70% reductions in Word Error Rate on benchmarks.
Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.