Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
This work addresses the problem of predicting microbial community properties from raw DNA sequences, offering a few-shot learning approach that improves generalization for microbiome researchers.
The authors propose SAGE, a set-aggregated genome embedding method using genomic language models, to predict microbiome abundance profiles. They show improved generalization on novel genomes compared to classical bioinformatics approaches, with model ablation confirming that community-level latent representations directly improve performance.
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.