CLDec 29, 2023

Normalization of Lithuanian Text Using Regular Expressions

arXiv:2312.17660v2h-index: 6
Originality Synthesis-oriented
AI Analysis

This work addresses text normalization for Lithuanian, a domain-specific problem, and is incremental as it applies existing methods to a new language.

The paper tackles text normalization for Lithuanian text-to-speech synthesis by developing a taxonomy of semiotic classes and rule sets using regular expressions to detect and expand non-standard words, achieving accuracy assessed through experiments on three datasets.

Text Normalization is an integral part of any text-to-speech synthesis system. In a natural language text, there are elements such as numbers, dates, abbreviations, etc. that belong to other semiotic classes. They are called non-standard words (NSW) and need to be expanded into ordinary words. For this purpose, it is necessary to identify the semiotic class of each NSW. The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work. Sets of rules are created for detecting and expanding NSWs based on regular expressions. Experiments with three completely different data sets were performed and the accuracy was assessed. Causes of errors are explained and recommendations are given for the development of text normalization rules.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes