CLMay 28, 2025

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

arXiv:2505.22273v12 citationsh-index: 7EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses lexical normalization for unsegmented languages, providing a comprehensive evaluation that is incremental in nature.

The paper tackled the problem of lexical normalization for unsegmented languages by creating a large-scale Japanese dataset and developing methods based on pretrained models, achieving promising results in accuracy and efficiency.

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes