94.0DLJun 3
MIRAI: Prediction and Generation of High-Impact Academic ResearchAlex Li, Joseph Jacobson
The rapid pace of scientific publishing has made the identification and synthesis of high-impact work an increasingly urgent challenge. We introduce MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts paper impact using only it's title, abstract, and publication date. We train MIRAI on the arXiv academic graph to predict 5-year PageRank and citation counts, achieving Spearman's $ρ$ of 0.4686 on PageRank prediction and 0.6192 on citation prediction for papers published in 2021. We propose a research ideation pipeline built on top of MIRAI that produces research ideas oriented towards high impact. These ideas were judged as more impactful than a baseline without MIRAI by an unbiased LLM judge at a 4:3 ratio. We make the 5-year citation prediction model publicly available at https://predict-paper-impact.vercel.app.
LGOct 4, 2023
Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical Coarse-graining SO(3)-Equivariant AutoencodersAllan dos Santos Costa, Ilan Mitnikov, Mario Geiger et al.
Three-dimensional native states of natural proteins display recurring and hierarchical patterns. Yet, traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution, and lacks hourglass neural architectures to learn those high-level building blocks. We narrow this gap by introducing Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all-atom protein structures. Our model departs from current approaches that employ graph modeling, instead focusing on local convolutional coarsening to model sequence-motif interactions with efficient time complexity in protein length. We measure the reconstruction capabilities of Ophiuchus across different compression rates, and compare it to existing models. We examine the learned latent space and demonstrate its utility through conformational interpolation. Finally, we leverage denoising diffusion probabilistic models (DDPM) in the latent space to efficiently sample protein structures. Our experiments demonstrate Ophiuchus to be a scalable basis for efficient protein modeling and generation.
LGOct 12, 2024
EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic InterpolantsAllan dos Santos Costa, Ilan Mitnikov, Franco Pellegrini et al.
Mapping the conformational dynamics of proteins is crucial for elucidating their functional mechanisms. While Molecular Dynamics (MD) simulation enables detailed time evolution of protein motion, its computational toll hinders its use in practice. To address this challenge, multiple deep learning models for reproducing and accelerating MD have been proposed drawing on transport-based generative methods. However, existing work focuses on generation through transport of samples from prior distributions, that can often be distant from the data manifold. The recently proposed framework of stochastic interpolants, instead, enables transport between arbitrary distribution endpoints. Building upon this work, we introduce EquiJump, a transferable SO(3)-equivariant model that bridges all-atom protein dynamics simulation time steps directly. Our approach unifies diverse sampling methods and is benchmarked against existing models on trajectory data of fast folding proteins. EquiJump achieves state-of-the-art results on dynamics simulation with a transferable model on all of the fast folding proteins.
BMSep 16, 2025
Accelerating Protein Molecular Dynamics Simulation with DeepJumpAllan dos Santos Costa, Manvitha Ponnapati, Dana Rubin et al.
Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high-resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long-timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow Matching-based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model's performance in generalizing to long-term dynamics of fast-folding proteins and characterizing the trade-off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant $\approx$1000$\times$ computational acceleration while effectively recovering long-timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.
BMMar 3, 2025
RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlowDana Rubin, Allan dos Santos Costa, Manvitha Ponnapati et al.
Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples, suggesting that co-generation of sequence and structure is a competitive approach for modeling RNA.
LGJul 16, 2020
OccamNet: A Fast Neural Model for Symbolic Regression at ScaleOwen Dugan, Rumen Dangovski, Allan Costa et al.
Neural networks' expressiveness comes at the cost of complex, black-box models that often extrapolate poorly beyond the domain of the training dataset, conflicting with the goal of finding compact analytic expressions to describe scientific data. We introduce OccamNet, a neural network model that finds interpretable, compact, and sparse symbolic fits to data, à la Occam's razor. Our model defines a probability distribution over functions with efficient sampling and function evaluation. We train by sampling functions and biasing the probability mass toward better fitting solutions, backpropagating using cross-entropy matching in a reinforcement-learning loss. OccamNet can identify symbolic fits for a variety of problems, including analytic and non-analytic functions, implicit functions, and simple image classification, and can outperform state-of-the-art symbolic regression methods on real-world regression datasets. Our method requires a minimal memory footprint, fits complicated functions in minutes on a single CPU, and scales on a GPU.