CLJun 25, 2021

Manually Annotated Spelling Error Corpus for Amharic

arXiv:2106.13521v15 citations
Originality Synthesis-oriented
AI Analysis

This addresses a problem for NLP researchers and developers working on Amharic language processing by providing a foundational dataset, though it is incremental as it builds on existing corpus creation methods.

The paper tackles the lack of resources for spelling error detection and correction in Amharic by creating a manually annotated corpus with 1,000 sentences and 5,000 tokens, tagged for non-word and real-word errors, which enables evaluation and handling of both error types.

This paper presents a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes