Interactive Data Harmonization with LLM Agents: Opportunities and Challenges
This addresses the challenge of integrating diverse datasets for experts, but it appears incremental as it builds on existing methods with LLM integration.
The paper tackles the problem of data harmonization, which is time-consuming due to schema mismatches and terminological differences, by introducing Harmonia, a system that uses LLM agents to automate pipeline synthesis, demonstrated in a clinical scenario to create reusable mapping pipelines.
Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.