BMLGMar 2, 2022

Biological Sequence Design with GFlowNets

MILA
arXiv:2203.04115v3231 citationsh-index: 57
Originality Incremental advance
AI Analysis

This addresses the challenge of expensive wet-lab evaluations in biological sequence design for researchers, though it is incremental as it builds on existing GFlowNets and active learning techniques.

The paper tackles the problem of designing biological sequences with desired properties by proposing an active learning algorithm that uses GFlowNets and epistemic uncertainty to generate diverse and informative candidates, resulting in more diverse and novel batches with high-scoring candidates compared to existing methods.

Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes