CYMay 14

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

arXiv:2605.1444269.3
Predicted impact top 16% in CY · last 90 daysOriginality Highly original
AI Analysis

This work addresses the genotype-to-physiology gap for microbial strains, offering a unified computational approach that reduces reliance on exhaustive in vitro screening.

GGBound introduces a genome-conditioned, tool-augmented LLM agent for predicting microbial life boundaries (e.g., temperature, pH, salinity) from genome sequences, achieving performance matching or surpassing larger frontier LLMs on a curated benchmark of 1,525 strains and 6,448 instances.

Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes