CLNov 3, 2025

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek

arXiv:2511.01619v11 citationsh-index: 7

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for researchers in linguistics, speech processing, and political science studying Slavic parliamentary speech, though it is incremental as it builds on existing ParlaMint transcripts.

The authors tackled the problem of limited annotated spoken parliamentary corpora for Slavic languages by releasing ParlaSpeech 3.0, a collection of 6,000 hours of automatically enriched data for Croatian, Czech, Polish, and Serbian with linguistic annotations, sentiment predictions, disfluency markers, and alignments, which they demonstrated through an analysis of acoustic correlates of sentiment.

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

View on arXiv PDF

Similar