On the Definition of Japanese Word
This work tackles a foundational linguistic annotation problem for Japanese NLP researchers, but it is incremental as it builds on existing non-mainstream definitions without presenting new empirical results.
The paper addresses the unclear definition of syntactic words in Japanese for Universal Dependencies annotation, arguing that the current Short Unit Words do not meet the guidelines and exploring the feasibility of applying alternative linguistic definitions to corpus annotation.
The annotation guidelines for Universal Dependencies (UD) stipulate that the basic units of dependency annotation are syntactic words, but it is not clear what are syntactic words in Japanese. Departing from the long tradition of using phrasal units called bunsetsu for dependency parsing, the current UD Japanese treebanks adopt the Short Unit Words. However, we argue that they are not syntactic word as specified by the annotation guidelines. Although we find non-mainstream attempts to linguistically define Japanese words, such definitions have never been applied to corpus annotation. We discuss the costs and benefits of adopting the rather unfamiliar criteria.