CLJul 6, 2020

Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

arXiv:2007.03006v139 citations
AI Analysis

This provides a large-scale, cleaned dataset for machine translation and NLP research involving Czech and English, though it is incremental as an update to an existing corpus.

The authors introduced CzEng 2.0, a Czech-English parallel corpus with over 2 billion words per language, featuring document-level information and noise reduction through filtering techniques, and it includes new authentic and synthetic data.

We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is freely available for research and educational purposes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes