DNACHUNKER: Learnable Tokenization for DNA Language Models
For researchers using DNA language models, this work addresses the brittleness of fixed tokenization under genomic variation, offering a more robust representation.
DNAChunker introduces a learnable adaptive segmentation module for DNA language models, producing context-dependent variable-length tokens. It outperforms fixed-tokenization baselines across five benchmarks on the human reference genome.
DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed tokenizations, segmentation is learned in a biologically-informed, mutation-resilient manner.