MULTEXT-East
This provides a standardized, freely available dataset for researchers in computational linguistics and natural language processing, though it is incremental as it builds on existing resources and focuses on specific languages.
The paper introduces the MULTEXT-East dataset, a multilingual resource for language engineering that includes morphosyntactic specifications, lexicons, and an annotated parallel corpus of '1984' across 16 languages, with hand-validated annotations and uniform XML encoding.
MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel "1984" by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.