CLOct 23, 2024

SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments

arXiv:2410.17886v1105 citationsh-index: 3CPSS
Originality Synthesis-oriented
AI Analysis

This provides political science researchers with a resource for fine-grained analyses, though it is incremental as it focuses on data collection rather than new methods.

The authors tackled the lack of meta-information in political speech corpora by creating SpeakGer, a dataset of 10,806,105 German parliament speeches from 1947-2023 with rich metadata, enabling deeper quantitative analysis.

The application of natural language processing on political texts as well as speeches has become increasingly relevant in political sciences due to the ability to analyze large text corpora which cannot be read by a single person. But such text corpora often lack critical meta information, detailing for instance the party, age or constituency of the speaker, that can be used to provide an analysis tailored to more fine-grained research questions. To enable researchers to answer such questions with quantitative approaches such as natural language processing, we provide the SpeakGer data set, consisting of German parliament debates from all 16 federal states of Germany as well as the German Bundestag from 1947-2023, split into a total of 10,806,105 speeches. This data set includes rich meta data in form of information on both reactions from the audience towards the speech as well as information about the speaker's party, their age, their constituency and their party's political alignment, which enables a deeper analysis. We further provide three exploratory analyses, detailing topic shares of different parties throughout time, a descriptive analysis of the development of the age of an average speaker as well as a sentiment analysis of speeches of different parties with regards to the COVID-19 pandemic.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes