DBIRApr 19, 2021

Large Scale Record Linkage in the Presence of Missing Data

arXiv:2104.09677v1
Originality Highly original
AI Analysis

This addresses the need for accurate data integration in domains like health analytics and national security, representing a novel method for a known bottleneck.

The paper tackles the problem of low linkage quality in record linkage due to errors, variations, and missing quasi-identifying values by proposing a novel technique that uses attribute and relational signatures, achieving high linkage quality on large real-world databases with substantial missing data.

Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes