CLJun 3, 2025

taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen

arXiv:2506.05388v16.71 citationsh-index: 8ACL

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited data for German NLP and computational social science, enabling research on societal issues like gender bias, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of large-scale German language resources by creating taz2024full, a corpus of over 1.8 million newspaper articles from 1980 to 2024, and used it to analyze gender bias, finding a consistent overrepresentation of men but a gradual shift toward more balanced coverage in recent years.

Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.

View on arXiv PDF

Similar