CLMar 27, 2015

Normalization of Non-Standard Words in Croatian Texts

Slobodan Beliga, Miran Pobar, Sanda Martinčić-Ipšić

arXiv:1503.08167v22.22 citations

Originality Synthesis-oriented

AI Analysis

This work addresses text normalization for Croatian language processing, which is incremental as it applies existing rule-based and dictionary methods to a specific domain.

The paper tackled the problem of normalizing non-standard words in Croatian texts for text-to-speech synthesis, achieving a 95% token normalization rate with 80% of expanded words in correct morphological form.

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language together with rule-based normalization methods combined with a lookup dictionary are proposed. Achieved token rate for normalization of Croatian texts is 95%, where 80% of expanded words are in correct morphological form.

View on arXiv PDF

Similar