CL IRJun 7, 2020

Interactive Extractive Search over Biomedical Corpora

Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan Lahav, Matan Eyal, Yaara Cohen, Yoav Goldberg

arXiv:2006.04148v131.21004 citations

Originality Incremental advance

AI Analysis

This system addresses the need for accessible and efficient search tools for biomedical researchers, though it is incremental as it builds on existing dependency-based search methods.

The paper tackles the problem of enabling life-science researchers to search biomedical corpora using dependency graphs and other patterns without requiring knowledge of linguistic details, by introducing a lightweight query language based on example sentences with markup, and demonstrates it on large datasets like PubMed and CORD-19 with interactive speed.

We present a system that allows life-science researchers to search a linguistically annotated corpus of scientific texts using patterns over dependency graphs, as well as using patterns over token sequences and a powerful variant of boolean keyword queries. In contrast to previous attempts to dependency-based search, we introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of user queries. We demonstrate the system using example workflows over two corpora: the PubMed corpus including 14,446,243 PubMed abstracts and the CORD-19 dataset, a collection of over 45,000 research papers focused on COVID-19 research. The system is publicly available at https://allenai.github.io/spike

View on arXiv PDF

Similar