M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution
This work addresses the computational bottleneck in genomic analysis for researchers, though it is incremental as it builds on existing transformer and linear attention methods.
The authors tackled the challenge of scaling transformer models to handle multi-million nucleotide bacterial genomes by introducing a linear attention mechanism, achieving stable performance up to 2 million nucleotides during testing on a single GPU.
A linear attention mechanism is described to extend the context length of an encoder only transformer, called M5 in this report, to a multi-million single nucleotide resolution foundation model pretrained on bacterial whole genomes. The linear attention mechanism used approximates a full quadratic attention mechanism tightly and has a simple and lightweight implementation for the use case when the key-query embedding dimensionality is low. The M5-small model is entirely trained and tested on one A100 GPU with 40gb of memory up to 196K nucleotides during training and 2M nucleotides during testing. We test the performance of the M5-small model and record notable improvements in performance as whole genome bacterial sequence lengths are increased as well as demonstrating the stability of the full multi-head attention approximation used as sequence length is increased.