LG GNMar 24

Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein

arXiv:2603.233619.0h-index: 1

AI Analysis

This work addresses the need for interpretable AI in biology by providing a model that integrates molecular processes, enabling predictions and side effect screening without new experiments, though it is incremental in extending prior mechanism-oriented approaches.

The paper tackles the problem of biological AI models having disconnected representations from molecular processes by introducing CDT-III, a mechanism-oriented model for DNA, RNA, and protein prediction, achieving per-gene correlations of r=0.843 for RNA and r=0.969 for protein, and correctly predicting 29/29 protein changes in a knockdown simulation.

Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Applied to in silico CD52 knockdown approximating Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data. Gradient-based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new experiments.

View on arXiv PDF

Similar