Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names
For researchers conducting large-scale scientometric analyses involving Chinese authors, this work provides a practical, script-agnostic disambiguation method that improves recall over existing approaches.
The paper addresses the challenge of disambiguating Chinese author names in scholarly metadata, which is particularly difficult due to the ambiguity of Romanized Pinyin. The proposed rule-based framework achieves F1-scores of 0.88 for Pinyin and 0.89 for Chinese character names on a sample of 80 name pairs, outperforming baselines.
Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method achieves F1-scores of 0.88 for Pinyin names and 0.89 for character-based names, outperforming two baseline approaches, with improvements driven primarily by higher recall. The comparable performance across both writing systems shows that our approach is script-agnostic, enabling reliable large-scale scientometric analyses.