Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages
This work addresses lexical normalization for unsegmented languages, providing a comprehensive evaluation that is incremental in nature.
The paper tackled the problem of lexical normalization for unsegmented languages by creating a large-scale Japanese dataset and developing methods based on pretrained models, achieving promising results in accuracy and efficiency.
Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.