CLApr 8, 2021

User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization

arXiv:2104.03523v1726 citations
Originality Synthesis-oriented
AI Analysis

This provides a challenging benchmark for researchers working on Japanese text processing, but it is incremental as it focuses on dataset creation rather than new methods.

The authors constructed a publicly available Japanese user-generated text corpus of 929 sentences annotated for morphological analysis and lexical normalization to evaluate existing systems, revealing low performance on non-general words and non-standard forms.

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes