CLMay 24, 2021

Diacritics Restoration using BERT with Analysis on Czech language

Jakub Náplava, Milan Straka, Jana Straková

arXiv:2105.11408v11.413 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses diacritics restoration for multiple languages, with a focus on Czech, but is incremental as it applies an existing method (BERT) to a new task.

The paper tackled diacritics restoration by proposing a BERT-based architecture, achieving evaluation on 12 languages and detailed error analysis on Czech, where 44% of mispredictions were found to be non-errors.

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

View on arXiv PDF Code

Similar