CLJun 14, 2021

Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

arXiv:2106.07241v11090 citations
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for NLP researchers and developers working on Amharic language processing, though it is incremental as it builds on existing tools and methods.

The authors tackled the lack of a large, tagged corpus for Amharic by creating a contemporary corpus with 24 million words from 25,199 documents, automatically tagged for morpho-syntactic information using a modified version of HornMorpho.

We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes