GNLGJan 13, 2025

Multi-megabase scale genome interpretation with genetic language models

arXiv:2501.07737v13 citationsh-index: 16
Originality Highly original
AI Analysis

This work addresses the problem of genome interpretation for disease understanding and risk prediction in genetics and medicine, representing a novel method for a known bottleneck.

The authors tackled the challenge of interpreting large-scale genome sequences to understand disease mechanisms by developing Phenformer, a genetic language model that generates mechanistic hypotheses from DNA sequences up to 88 million base pairs; they demonstrated that it matches literature better than existing methods and improves disease risk prediction performance.

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes