CLMar 7, 2017

Building a Syllable Database to Solve the Problem of Khmer Word Segmentation

arXiv:1703.02166v13 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of word segmentation for Khmer language processing, particularly in Southern Vietnam, and is incremental as it builds on existing research with a new database approach.

The paper tackles the problem of Khmer word segmentation, which is challenging due to the language's complex writing system and ambiguous phenomena, by building a syllable database using syllable models and lexical data, achieving high accuracy in tests.

Word segmentation is a basic problem in natural language processing. With the languages having the complex writing system like the Khmer language in Southern of Vietnam, this problem really very intractable, posing the significant challenges. Although there are some experts in Vietnam as well as international having deeply researched this problem, there are still no reasonable results meeting the demand, in particular, no treated thoroughly the ambiguous phenomenon, in the process of Khmer language processing so far. This paper present a solution based on the syllable division into component clusters using two syllable models proposed, thereby building a Khmer syllable database, is still not actually available. This method using a lexical database updated from the online Khmer dictionaries and some supported dictionaries serving role of training data and complementary linguistic characteristics. Each component cluster is labelled and located by the first and last letter to identify entirety a syllable. This approach is workable and the test results achieve high accuracy, eliminate the ambiguity, contribute to solving the problem of word segmentation and applying efficiency in Khmer language processing.

View on arXiv PDF

Similar