Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords
This provides a large-scale, cleaned dataset for machine translation and NLP research involving Czech and English, though it is incremental as an update to an existing corpus.
The authors introduced CzEng 2.0, a Czech-English parallel corpus with over 2 billion words per language, featuring document-level information and noise reduction through filtering techniques, and it includes new authentic and synthetic data.
We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is freely available for research and educational purposes.