DBApr 24, 2023
Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]Alexandros Zeakis, George Papadakis, Dimitrios Skoutas et al.
Many recent works on Entity Resolution (ER) leverage Deep Learning techniques involving language models to improve effectiveness. This is applied to both main steps of ER, i.e., blocking and matching. Several pre-trained embeddings have been tested, with the most popular ones being fastText and variants of the BERT model. However, there is no detailed analysis of their pros and cons. To cover this gap, we perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching. Our experimental results provide novel insights into the strengths and weaknesses of the main language models, facilitating researchers and practitioners to select the most suitable ones in practice.
DBJul 3, 2023
A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching AlgorithmsGeorge Papadakis, Nishadi Kirielle, Peter Christen et al.
Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.
NAOct 24, 2018
A Preconditioned Multiple Shooting Shadowing Algorithm for the Sensitivity Analysis of Chaotic SystemsKarim Shawki, George Papadakis
We propose a preconditioner that can accelerate the rate of convergence of the Multiple Shooting Shadowing (MSS) method. This recently proposed method can be used to compute derivatives of time-averaged objectives (also known as sensitivities) to system parameter(s) for chaotic systems. We propose a block diagonal preconditioner, which is based on a partial singular value decomposition of the MSS constraint matrix. The preconditioner can be computed using matrix-vector products only (i.e. it is matrix-free) and is fully parallelised in the time domain. Two chaotic systems are considered, the Lorenz system and the 1D Kuramoto Sivashinsky equation. Combination of the preconditioner with a regularisation method leads to tight bracketing of the eigenvalues to a narrow range. This combination results in a significant reduction in the number of iterations, and renders the convergence rate almost independent of the number of degrees of freedom of the system, and the length of the trajectory that is used to compute the time-averaged objective. This can potentially allow the method to be used for large chaotic systems (such as turbulent flows) and optimal control applications. The singular value decomposition of the constraint matrix can also be used to quantify the effect of the system condition on the accuracy of the sensitivities. In fact, neglecting the singular modes affected by noise, we recover the correct values of sensitivity that match closely with those obtained with finite differences for the Kuramoto Sivashinsky equation in the light turbulent regime.
IRJan 16, 2019
Comparative Analysis of Content-based Personalized Microblog Recommendations [Experiments and Analysis]Efi Karra Taniskidou, George Papadakis, George Giannakopoulos et al.
Microblogging platforms constitute a popular means of real-time communication and information sharing. They involve such a large volume of user-generated content that their users suffer from an information deluge. To address it, numerous recommendation methods have been proposed to organize the posts a user receives according to her interests. The content-based methods typically build a text-based model for every individual user to capture her tastes and then rank the posts in her timeline according to their similarity with that model. Even though content-based methods have attracted lots of interest in the data management community, there is no comprehensive evaluation of the main factors that affect their performance. These are: (i) the representation model that converts an unstructured text into a structured representation that elucidates its characteristics, (ii) the source of the microblog posts that compose the user models, and (iii) the type of user's posting activity. To cover this gap, we systematically examine the performance of 9 state-of-the-art representation models in combination with 13 representation sources and 3 user types over a large, real dataset from Twitter comprising 60 users. We also consider a wide range of 223 plausible configurations for the representation models in order to assess their robustness with respect to their internal parameters. To facilitate the interpretation of our experimental results, we introduce a novel taxonomy of representation models. Our analysis provides novel insights into the performance and functionality of the main factors determining the performance of content-based recommendation in microblogs.