LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
This work addresses the need for a comprehensive overview and standardization in language-driven single-cell biology research, which is incremental as it synthesizes existing methods rather than introducing new ones.
The paper tackles the fragmented progress in applying large language models and agentic frameworks to single-cell biology by presenting the first unified survey of 58 models across various modalities, analyzing them over 10 domain dimensions using over 40 datasets to provide an integrated view and outline open challenges.
Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.