CLMar 31

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Shohei Higashiyama, Masao Ideuchi, Masao Utiyama

arXiv:2603.2933628.9h-index: 7

AI Analysis

This provides a new benchmark for Japanese entity linking systems, addressing a gap in language resources, though it is incremental as it focuses on data creation rather than method innovation.

The authors tackled the lack of Japanese resources for entity linking by constructing CADEL, an annotated corpus with rich coverage of Japan-specific entities, achieving high inter-annotator agreement and containing many non-trivial cases for evaluation.

Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

View on arXiv PDF

Similar