GN LGAug 6, 2025

Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks

Nirjhor Datta, Swakkhar Shatabda, M Sohel Rahman

arXiv:2508.04757v1h-index: 29Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for more generalizable and efficient deployment of genomic AI models, especially in diverse or unseen contexts, though it is incremental as it builds on existing embedding methods.

The paper tackles the problem of expensive fine-tuning for large pre-trained DNA language models in genomic prediction tasks, showing that embedding-based pipelines with lightweight classifiers achieve competitive performance, often outperforming fine-tuning in different data distributions while reducing inference time by 10x to 20x and improving carbon efficiency by over 10x.

Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.

View on arXiv PDF Code

Similar