IRDLSep 27, 2017

Scaling Author Name Disambiguation with CNF Blocking

arXiv:1709.09657v112 citations
Originality Incremental advance
AI Analysis

This work addresses the computational bottleneck for researchers and institutions handling large scholarly databases, though it is incremental as it builds on existing blocking techniques.

The paper tackles the scalability issue in author name disambiguation by introducing a CNF-based blocking method, which reduces pairwise similarity calculations by 82.17% on a PubMed database of 80 million records in 10 minutes while maintaining high completeness.

An author name disambiguation (AND) algorithm identifies a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering difficult for millions of records. One remedy for this is a blocking function that reduces the number of pairwise similarity calculations. Here, we introduce a new way of learning blocking schemes by using a conjunctive normal form (CNF) in contrast to the disjunctive normal form (DNF). We demonstrate on PubMed author records that CNF blocking reduces more pairs while preserving high pairs completeness compared to the previous methods that use a DNF with the computation time significantly reduced. Thus, these concepts in scholarly data can be better represented with CNFs. Moreover, we also show how to ensure that the method produces disjoint blocks so that the rest of the AND algorithm can be easily paralleled. Our CNF blocking tested on the entire PubMed database of 80 million author mentions efficiently removes 82.17% of all author record pairs in 10 minutes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes