LGBMMay 9, 2022

Multi-segment preserving sampling for deep manifold sampler

arXiv:2205.04259v15 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses the problem of biased sampling in protein sequence generation for researchers in computational biology, though it is incremental as it builds on the existing deep manifold sampler.

The paper tackles the challenge of incorporating domain-specific knowledge into deep generative models for biological sequences by introducing multi-segment preserving sampling, which restricts variation to specific regions like CDR3 in antibody design, resulting in generated sequences that maintain preserved regions while achieving reasonable designs as measured by GPT-2 log probability scores.

Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes