DB AIMar 10

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

arXiv:2604.0855249.4h-index: 8

AI Analysis

This addresses the challenge of making biomedical datasets FAIR for researchers by automating metadata standardization, though it is incremental over prior LLM-guided approaches.

The paper tackled the problem of incomplete and noncompliant biomedical metadata by developing an LLM-based system that queries authoritative terminology services in real time, improving prediction accuracy over LLM-only methods on 839 legacy metadata records from HuBMAP.

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.

View on arXiv PDF

Similar