Duplicate Detection with GenAI
This addresses data quality issues for businesses using CRM systems, though it is incremental as it builds on existing entity matching methods with new AI techniques.
The paper tackles the problem of duplicate records in CRM systems by applying Large Language Models and Generative AI, achieving an improvement in de-duplication accuracy from 30% to nearly 60% on benchmark datasets.
Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.