CLNov 16, 2020

Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

arXiv:2011.07868v1
AI Analysis

This addresses the challenge of processing noisy web texts for Estonian language applications, but it is incremental as it evaluates existing systems on new data without introducing new methods.

The paper tackled the problem of sentence segmentation and word tokenization on noisy Estonian web texts, finding that EstNLTK achieved the highest performance compared to Stanza and UDPipe, which performed worse than on well-formed datasets.

Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes