PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models
This work addresses the problem of flexible and efficient protein research for biologists by providing a zero-shot approach, though it is incremental as it builds on existing large language models.
The authors tackled the lack of high-quality datasets and robust evaluation for zero-shot protein question answering by introducing the Pika framework, which includes a curated dataset and a biochemically relevant benchmarking strategy, with multimodal large language models proposed as a baseline.
Understanding protein structure and function is crucial in biology. However, current computational methods are often task-specific and resource-intensive. To address this, we propose zero-shot Protein Question Answering (PQA), a task designed to answer a wide range of protein-related queries without task-specific training. The success of PQA hinges on high-quality datasets and robust evaluation strategies, both of which are lacking in current research. Existing datasets suffer from biases, noise, and lack of evolutionary context, while current evaluation methods fail to accurately assess model performance. We introduce the Pika framework to overcome these limitations. Pika comprises a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy. We also propose multimodal large language models as a strong baseline for PQA, leveraging their natural language processing and knowledge. This approach promises a more flexible and efficient way to explore protein properties, advancing protein research. Our comprehensive PQA framework, Pika, including dataset, code, and model checkpoints, is openly accessible on github.com/EMCarrami/Pika, promoting wider research in the field.