DBCLApr 24, 2023

Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]

arXiv:2304.12329v164 citationsh-index: 60
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of detailed analysis on pre-trained embeddings for Entity Resolution, offering practical guidance for researchers and practitioners in selecting models, though it is incremental as it builds on existing methods.

The paper conducted a thorough experimental analysis of 12 popular language models for Entity Resolution across 17 benchmark datasets, evaluating their vectorization overhead, blocking performance, and matching effectiveness to provide insights into their strengths and weaknesses.

Many recent works on Entity Resolution (ER) leverage Deep Learning techniques involving language models to improve effectiveness. This is applied to both main steps of ER, i.e., blocking and matching. Several pre-trained embeddings have been tested, with the most popular ones being fastText and variants of the BERT model. However, there is no detailed analysis of their pros and cons. To cover this gap, we perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching. Our experimental results provide novel insights into the strengths and weaknesses of the main language models, facilitating researchers and practitioners to select the most suitable ones in practice.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes