BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
This addresses the problem of limited scientific progress in biology due to AI models' struggles with multi-step reasoning and lack of transparent explanations, offering a transformative framework for interpretable, mechanistic AI in biology.
BioReason tackles the challenge of enabling deep, interpretable biological reasoning from genomic data by integrating a DNA foundation model with a large language model, achieving a boost in KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over baselines.
Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason