LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation
For protein engineers, LineageFlow provides a method to generate diverse, family-recognizable sequences with improved plausibility, addressing the bottleneck of weak family control in discrete generative models.
LineageFlow introduces a Dirichlet flow-matching model for protein sequence generation that initializes from lineage priors derived from ancestral sequence reconstruction, achieving family validity close to natural sequences and improved structural confidence over baselines while maintaining novelty and diversity.
Protein sequence generation for engineering requires samples that are biophysically plausible and, when targeting a family/domain, remain recognizable members while exploring within-family diversity. Current discrete generative models typically start from uniform or masked-token noise, which discards strong position-specific constraints induced by evolution and forces the model to reconstruct conserved residues from scratch, leading to weak family control and low plausibility. We propose \emph{LineageFlow}, a Dirichlet flow-matching model that initializes generation from lineage priors derived from ancestral sequence reconstruction, turning generation into structured mutation from an evolved scaffold. Across diverse protein families, LineageFlow achieves family validity close to held-out natural sequences and improves predicted structural confidence over uniform-/mask-initialized baselines while maintaining substantial novelty and diversity. Finally, we introduce \emph{rerouting}, a single intermediate-time mutate--select--amplify intervention that enables objective-guided sampling without per-step predictor guidance and yields further gains in plausibility, including a zero-shot enzyme generation case study. Code is available at https://github.com/Jinx-byebye/LineageFlow.