CLMay 28, 2025

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

arXiv:2505.22273v16.72 citationsh-index: 7EMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses lexical normalization for unsegmented languages, providing a comprehensive evaluation that is incremental in nature.

The paper tackled the problem of lexical normalization for unsegmented languages by creating a large-scale Japanese dataset and developing methods based on pretrained models, achieving promising results in accuracy and efficiency.

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

View on arXiv PDF

Similar