Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning
This paper addresses the problem of developing more biologically informed self-supervised learning methods for bioinformatics, which could benefit researchers by providing more effective sequence representations.
This paper proposes a new perspective for self-supervised representation learning of biological sequences. It suggests using evolution as a natural sequence augmentation method to maximize information across phylogenetic "noisy channels," aiming to improve pretraining encoders for biological sequence embeddings.
Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks while circumventing expensive experimental label acquisition. However, existing methods mostly borrow directly from large language models designed for NLP, rather than with bioinformatics philosophies in mind. Recently, contrastive mutual information maximization methods have achieved state-of-the-art representations for ImageNet. In this perspective piece, we discuss how viewing evolution as natural sequence augmentation and maximizing information across phylogenetic "noisy channels" is a biologically and theoretically desirable objective for pretraining encoders. We first provide a review of current contrastive learning literature, then provide an illustrative example where we show that contrastive learning using evolutionary augmentation can be used as a representation learning objective which maximizes the mutual information between biological sequences and their conserved function, and finally outline rationale for this approach.