CLJun 25, 2021

Manually Annotated Spelling Error Corpus for Amharic

Andargachew Mekonnen Gezmu, Tirufat Tesifaye Lema, Binyam Ephrem Seyoum, Andreas Nürnberger

arXiv:2106.13521v10.75 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses a problem for NLP researchers and developers working on Amharic language processing by providing a foundational dataset, though it is incremental as it builds on existing corpus creation methods.

The paper tackles the lack of resources for spelling error detection and correction in Amharic by creating a manually annotated corpus with 1,000 sentences and 5,000 tokens, tagged for non-word and real-word errors, which enables evaluation and handling of both error types.

This paper presents a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.

View on arXiv PDF Code

Similar