GN LGOct 14, 2025

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

arXiv:2510.12617v11 citationsh-index: 27

Originality Synthesis-oriented

AI Analysis

This work addresses a critical issue for researchers in genomics and machine learning by exposing how standard ML practices can lead to unreliable benchmarks in specialized domains, though it is incremental in improving benchmarking methodology.

The study identified that hardware-dependent hyperparameters in data loading, such as the number of workers and buffer sizes, cause spurious performance variations of up to 4% in DNA Language Models benchmarking, affecting both absolute scores and model rankings. They proposed pre-shuffling data as a simple solution to eliminate these dependencies and maintain efficiency.

Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.

View on arXiv PDF

Similar