Huahang Li

DB
h-index5
3papers
9citations
Novelty50%
AI Score38

3 Papers

DBAug 24, 2024
Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

Longyu Feng, Huahang Li, Chen Jason Zhang

Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the optimal schema matching algorithm is different. For single algorithm, hyperparameter tuning also cases multiple results. All results assigned equal probabilities are stored in probabilistic databases to facilitate uncertainty management. The substantial degree of uncertainty diminishes the efficiency and reliability of data processing, thereby precluding the provision of more accurate information for decision-makers. To address this problem, we introduce a new approach based on fine-grained correspondence verification with specific prompt of Large Language Model. Our approach is an iterative loop that consists of three main components: (1) the correspondence selection algorithm, (2) correspondence verification, and (3) the update of probability distribution. The core idea is that correspondences intersect across multiple results, thereby linking the verification of correspondences to the reduction of uncertainty in candidate results. The task of selecting an optimal correspondence set to maximize the anticipated uncertainty reduction within a fixed budgetary framework is established as an NP-hard problem. We propose a novel $(1-1/e)$-approximation algorithm that significantly outperforms brute algorithm in terms of computational efficiency. To enhance correspondence verification, we have developed two prompt templates that enable GPT-4 to achieve state-of-the-art performance across two established benchmark datasets. Our comprehensive experimental evaluation demonstrates the superior effectiveness and robustness of the proposed approach.

DBApr 27
DataClaw: An Autonomous Data Agent with Instant Messaging Integration

Huahang Li, Wentao Hu, Zhuoyue Wan et al.

In daily life, there are many scenarios that people need to tackle data-related tasks, such as filling out forms, analyzing Excel files, and visualize data report. However, the tools available for these tasks often fragment, requiring users to switch between multiple applications and manually orchestrate steps like data processing, querying, and visualization. Moreover, these tools often assume a certain level of technical proficiency, creating barriers for non-technical users. To facilitate tacking daily data task, we present DataClaw, an autonomous data agent that integrates directly into familiar instant messaging (IM) platforms. By simply typing a natural language request in a chat interface, users enable DataClaw to autonomously plan and execute a complete analytical pipeline, delivering insights, charts, and reports directly back into the conversation. Under the hood, DataClaw is powered by a transparent ReAct reasoning engine, a multi-tiered memory system for cross session context preservation, and a pluggable skill architecture for on-the-fly extensibility. In this demonstration, attendees will interact with DataClaw via standard IM platforms to solve real-world data scenarios, experiencing how it serves as a highly capable personal data assistant.

CLJan 7, 2024
On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Huahang Li, Longyu Feng, Shuangyin Li et al.

Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. Large Language Models (LLMs) introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.