Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics
This addresses data generation bottlenecks for researchers in fields like population genetics and bioinformatics, though it appears incremental as it builds on Restricted Boltzmann Machines with a novel training algorithm.
The paper tackles the challenge of inefficient Markov chain Monte Carlo mixing in energy-based models for generating high-quality, label-specific structured data, achieving improved classification and generation in only a few sampling steps across datasets like handwritten digits and human genome mutations.
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.