Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences
This addresses a critical limitation for synthetic biology applications, where models need to generalize to engineered sequences, but the findings are incremental as they highlight existing issues rather than proposing a new solution.
The paper tackled the problem of genomic language models (gLMs) failing to understand DNA function beyond evolutionary patterns, by introducing the Nullsettes benchmark to test prediction of loss-of-function mutations in synthetic sequences. The result showed that most of 12 state-of-the-art gLMs failed to detect these mutations, with accuracy dropping sharply as sequence likelihood decreased.
Genomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.