LGAIQMAug 31, 2025

Why Pool When You Can Flow? Active Learning with GFlowNets

arXiv:2509.00704v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of computational cost in active learning for drug discovery, offering a scalable solution for virtual screening with billions of samples, though it is incremental as it builds on existing methods like BALD and GFlowNets.

The paper tackles the computational scalability issue in pool-based active learning for large datasets like virtual screening in drug discovery, by introducing BALD-GFlowNet, a generative framework that replaces traditional acquisition with GFlowNet sampling to achieve performance comparable to standard BALD while generating more structurally diverse molecules.

The scalability of pool-based active learning is limited by the computational cost of evaluating large unlabeled datasets, a challenge that is particularly acute in virtual screening for drug discovery. While active learning strategies such as Bayesian Active Learning by Disagreement (BALD) prioritize informative samples, it remains computationally intensive when scaled to libraries containing billions samples. In this work, we introduce BALD-GFlowNet, a generative active learning framework that circumvents this issue. Our method leverages Generative Flow Networks (GFlowNets) to directly sample objects in proportion to the BALD reward. By replacing traditional pool-based acquisition with generative sampling, BALD-GFlowNet achieves scalability that is independent of the size of the unlabeled pool. In our virtual screening experiment, we show that BALD-GFlowNet achieves a performance comparable to that of standard BALD baseline while generating more structurally diverse molecules, offering a promising direction for efficient and scalable molecular discovery.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes