AIJan 7
Interleaved Tool-Call Reasoning for Protein Function UnderstandingChuanliu Fan, Zicheng Ma, Huanran Meng et al.
Recent advances in large language models (LLMs) have highlighted the effectiveness of chain-of-thought reasoning in symbolic domains such as mathematics and programming. However, our study shows that directly transferring such text-based reasoning paradigms to protein function understanding is ineffective: reinforcement learning mainly amplifies superficial keyword patterns while failing to introduce new biological knowledge, resulting in limited generalization. We argue that protein function prediction is a knowledge-intensive scientific task that fundamentally relies on external biological priors and computational tools rather than purely internal reasoning. To address this gap, we propose PFUA, a tool-augmented protein reasoning agent that unifies problem decomposition, tool invocation, and grounded answer generation. Instead of relying on long unconstrained reasoning traces, PFUA integrates domain-specific tools to produce verifiable intermediate evidence. Experiments on four benchmarks demonstrate that PFUA consistently outperforms text-only reasoning models with an average performance improvement of 103%.
LGMar 31, 2025
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-TuningHuandong Chang, Zicheng Ma, Mingyuan Ma et al.
Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.
CEFeb 27, 2025
ChatMol: A Versatile Molecule Designer Based on the Numerically Enhanced Large Language ModelChuanliu Fan, Ziqiang Cao, Zicheng Ma et al.
Goal-oriented de novo molecule design, namely generating molecules with specific property or substructure constraints, is a crucial yet challenging task in drug discovery. Existing methods, such as Bayesian optimization and reinforcement learning, often require training multiple property predictors and struggle to incorporate substructure constraints. Inspired by the success of Large Language Models (LLMs) in text generation, we propose ChatMol, a novel approach that leverages LLMs for molecule design across diverse constraint settings. Initially, we crafted a molecule representation compatible with LLMs and validated its efficacy across multiple online LLMs. Afterwards, we developed specific prompts geared towards diverse constrained molecule generation tasks to further fine-tune current LLMs while integrating feedback learning derived from property prediction. Finally, to address the limitations of LLMs in numerical recognition, we referred to the position encoding method and incorporated additional encoding for numerical values within the prompt. Experimental results across single-property, substructure-property, and multi-property constrained tasks demonstrate that ChatMol consistently outperforms state-of-the-art baselines, including VAE and RL-based methods. Notably, in multi-objective binding affinity maximization task, ChatMol achieves a significantly lower KD value of 0.25 for the protein target ESR1, while maintaining the highest overall performance, surpassing previous methods by 4.76%. Meanwhile, with numerical enhancement, the Pearson correlation coefficient between the instructed property values and those of the generated molecules increased by up to 0.49. These findings highlight the potential of LLMs as a versatile framework for molecule generation, offering a promising alternative to traditional latent space and RL-based approaches.
BMMar 11, 2025
ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language ModelsZicheng Ma, Chuanliu Fan, Zhicong Wang et al.
Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.
LGFeb 7, 2025
Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and StructureZhicong Wang, Zicheng Ma, Ziqiang Cao et al.
Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early-Fusion, LLM
LGAug 17, 2025
ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction CompressionChuanliu Fan, Zicheng Ma, Jun Gao et al.
Recent advances in protein large language models, such as ProtTeX, represent both side-chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue-level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single-protein inputs, rendering it incompatible with in-context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX-CC, a lightweight two-stage compression framework designed to enhance ProtTeX under few-shot settings. We first design a joint embedding compression mechanism that fuses sequence and structure representations at the residue level, effectively reducing the protein input length by half without sacrificing performance. Then we propose a self-compression module that aggregates each full demonstration into the latent space of the last few linguistic tokens, reducing the average demonstration length from 751 tokens to less than 16 tokens. Compared to the original ProtTeX, our self-compression approach achieves a compression ratio of approximately 93.68% in the total prompt length under the 16-shot setting. Without modifying the backbone model, ProtTeX-CC introduces only a small number of additional parameters through PEFT-based tuning in the joint embedding compression stage and a single trainable projection layer in the self-compression stage. Extensive experiments on protein function prediction show that ProtTeX-CC improves performance on the in-domain benchmark by 2%, and generalizes well to the out-of-domain dataset with a performance gain of 11%.