GNAILGFeb 8, 2024

DiscDiff: Latent Diffusion Model for DNA Sequence Generation

arXiv:2402.06079v221 citationsh-index: 37
AI Analysis

This work addresses DNA sequence generation for applications like gene therapy and protein production, representing a novel method for a known bottleneck in generative modeling.

The paper tackles DNA sequence generation by introducing DiscDiff, a latent diffusion model tailored for discrete sequences, and Absorb-Escape, a post-training algorithm to correct errors, achieving superior performance over existing models in generating both short and long sequences. It also introduces EPD-GenDNA, a dataset with 160,000 sequences from 15 species.

This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes