CLApr 10

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar, Gabriel Stanovsky

arXiv:2604.0923766.3Has Code

AI Analysis

This addresses the slow and error-prone process of manual data extraction for domain experts in fields like law and biology, offering a tool to automate and steer schema discovery.

The paper tackles the problem of extracting structured data from large document collections to answer natural-language research questions, traditionally requiring manual schema design and labeling, by introducing ScheMatiQ, which uses an LLM to automatically generate schemas and databases, enabling real-world analysis in law and computational biology.

Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

View on arXiv PDF

Similar