CODBLGMLSep 13, 2019

d-blink: Distributed End-to-End Bayesian Entity Resolution

arXiv:1909.06039v329 citations
Originality Highly original
AI Analysis

This addresses a critical bottleneck for practitioners in data integration and record linkage by enabling scalable Bayesian methods with rigorous uncertainty quantification, though it is an incremental improvement over existing Bayesian frameworks.

The paper tackles the scalability issue in Bayesian entity resolution, where existing models scale quadratically with record numbers, by proposing d-blink, a distributed model that jointly performs blocking and entity resolution without compromising posterior correctness, achieving efficient performance on six datasets including the 2010 Decennial Census.

Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naïve approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes