CLJun 14, 2021

Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser, Andreas Nürnberger

arXiv:2106.07241v136.01090 citations

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for NLP researchers and developers working on Amharic language processing, though it is incremental as it builds on existing tools and methods.

The authors tackled the lack of a large, tagged corpus for Amharic by creating a contemporary corpus with 24 million words from 25,199 documents, automatically tagged for morpho-syntactic information using a modified version of HornMorpho.

We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.

View on arXiv PDF

Similar