Efficient Model Repository for Entity Resolution: Construction, Search, and Integration
This addresses scalability and heterogeneity issues in multi-source entity resolution for data integration, though it is incremental as it builds on existing model reuse concepts.
The paper tackles the challenge of reusing classification models across multiple entity resolution tasks by proposing MoRER, which clusters similar tasks to build a model repository with moderate labeling effort, achieving comparable or better results than label-limited methods like active learning and outperforming self-supervised approaches on three multi-source datasets.
Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or nonmatches, which in multi-source ER (MS-ER) scenarios can become complicated due to data source heterogeneity and scalability issues. Existing methods for MS-ER generally require labeled record pairs, and such methods fail to effectively reuse models across multiple ER tasks. We propose MoRER (Model Repositories for Entity Resolution), a novel method for building a model repository consisting of classification models that solve ER problems. By leveraging feature distribution analysis, MoRER clusters similar ER tasks, thereby enabling the effective initialization of a model repository with a moderate labeling effort. Experimental results on three multi-source datasets demonstrate that MoRER achieves comparable or better results to methods that have label-limited budgets, such as active learning and transfer learning approaches, while outperforming self-supervised approaches that utilize large pre-trained language models. When compared to supervised transformer-based methods, MoRER achieves comparable or better results, depending on the size of the training data set used.