GN AI CE CLFeb 13, 2024

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu

arXiv:2402.08777v312.245 citationsh-index: 62Has CodeBioinform.

Originality Incremental advance

AI Analysis

This work addresses the problem of species differentiation in genomics for researchers, but it is incremental as it builds upon existing foundation models with novel training strategies.

The paper tackles the challenge of differentiating species from genomic sequences, especially when reference genomes are unavailable, by introducing DNABERT-S, a model that uses species-aware embeddings to cluster DNA sequences, achieving results such as identifying twice as many species from unlabeled mixtures and doubling the Adjusted Rand Index in clustering.

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

View on arXiv PDF Code

Similar