CLDec 10, 2019

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Marta R. Costa-jussà, Pau Li Lin, Cristina España-Bonet

arXiv:1912.04778v130.21006 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for non-synthetic, gender-balanced datasets in machine translation, particularly for researchers and practitioners in NLP, though it is incremental in standardizing extraction procedures.

The authors tackled the problem of gender inequality in Wikipedia by developing GeBioToolkit to automatically extract a gender-balanced multilingual corpus from Wikipedia biographies, resulting in a high-quality dataset of 2,000 sentences in English, Spanish, and Catalan for machine translation evaluation.

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machinetranslation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets

View on arXiv PDF Code

Similar