A Hierarchical Graphical Model for Record Linkage
This work addresses the challenge of record linkage in scenarios with limited labeled data, offering an unsupervised approach that is incremental in nature.
The paper tackles the problem of unsupervised record linkage for matching co-referent records, proposing a hierarchical graphical model framework that includes new methods and integrates existing ones, with experimental results showing competitive performance against supervised methods.
The task of matching co-referent records is known among other names as rocord linkage. For large record-linkage problems, often there is little or no labeled data available, but unlabeled data shows a reasonable clear structure. For such problems, unsupervised or semi-supervised methods are preferable to supervised methods. In this paper, we describe a hierarchical graphical model framework for the linakge-problem in an unsupervised setting. In addition to proposing new methods, we also cast existing unsupervised probabilistic record-linkage methods in this framework. Some of the techniques we propose to minimize overfitting in the above model are of interest in the general graphical model setting. We describe a method for incorporating monotinicity constraints in a graphical model. We also outline a bootstrapping approach of using "single-field" classifiers to noisily label latent variables in a hierarchical model. Experimental results show that our proposed unsupervised methods perform quite competitively even with fully supervised record-linkage methods.