Bianet: A Parallel News Corpus in Turkish, Kurdish and English
This provides a resource for multilingual NLP research, particularly for low-resource languages like Kurdish, but is incremental as it focuses on data collection rather than novel methods.
The authors tackled the lack of parallel news data for Turkish, Kurdish, and English by creating Bianet, an open-source corpus, and validated it by evaluating neural machine translation models, showing improvements in translation tasks.
We present a new open-source parallel corpus consisting of news articles collected from the Bianet magazine, an online newspaper that publishes Turkish news, often along with their translations in English and Kurdish. In this paper, we describe the collection process of the corpus and its statistical properties. We validate the benefit of using the Bianet corpus by evaluating bilingual and multilingual neural machine translation models in English-Turkish and English-Kurdish directions.