LGJun 20, 2025

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

arXiv:2506.17007v29.42 citationsh-index: 31Has Code

Originality Incremental advance

AI Analysis

This work addresses a bottleneck in scientific discovery for researchers by improving candidate generation, though it appears incremental as it builds on existing RL methods with new operators and robustness perspectives.

The paper tackles the problem of generating diverse, high-quality candidates in large discrete spaces (e.g., proteins or molecules) by addressing the issue of overly diverse, suboptimal outputs in existing reinforcement learning methods, and introduces a novel algorithm (TGM) that outperforms baselines in synthetic and real-world tasks.

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

View on arXiv PDF Code

Similar