AIMar 27, 2024

Leveraging Large Language Models for Fuzzy String Matching in Political Science

arXiv:2403.18218v1h-index: 7
Originality Highly original
AI Analysis

This addresses a key data integration issue for political scientists, offering a more intuitive and effective solution compared to traditional string distance methods.

The paper tackles the problem of fuzzy string matching in political science by using large language models to match entities with different names, achieving up to a 39% improvement in average precision over existing methods.

Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes