IRMar 25

VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models

arXiv:2603.2384967.0h-index: 31
Predicted impact top 49% in IR · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of automating scientific information extraction for researchers in virology, though it is incremental as it adapts RAG to a new domain.

The study tackled the lack of high-quality datasets for training ML models in science by developing VILLA, a multi-step RAG framework for extracting mutations from virology literature, achieving superior performance compared to existing methods on a novel dataset of 629 mutations from 239 publications.

The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes