CL AIMay 26, 2025

Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li

arXiv:2505.20354v413.07 citationsh-index: 6Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses evaluation challenges in protein understanding for bioinformatics, though it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of evaluating protein-text models by identifying data leakage and inadequate metrics in existing benchmarks, and introduced a retrieval-enhanced method that outperforms fine-tuned LLMs in protein-to-text generation with improved accuracy and efficiency.

In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

View on arXiv PDF Code

Similar