CLOct 1, 2020

Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT

arXiv:2010.00287v2993 citations
Originality Synthesis-oriented
AI Analysis

This work addresses practical issues in Persian text processing for users and applications dealing with noisy or incorrectly formatted text, representing a domain-specific improvement.

The paper tackled the joint problems of Persian word segmentation correction and zero-width non-joiner recognition by framing them as a sequence labeling task, achieving a macro-averaged F1-score of 92.40% on a challenging corpus of 500 sentences.

Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieved a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes