SwissGPC v1.0 -- The Swiss German Podcasts Corpus
This provides a valuable resource for researchers in ASR, TTS, and dialect identification, addressing a gap for real-world speech applications in Swiss German, though it is incremental as it builds on existing corpus efforts.
The researchers tackled the lack of large-scale spontaneous Swiss German speech data by creating SwissGPC v1.0, a corpus of approximately 5000 hours of annotated audio from podcasts and talk shows, covering major dialect regions and Standard German.
We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.