BMJun 20, 2023
Top-down machine learning of coarse-grained protein force-fieldsCarles Navarro, Maciej Majewski, Gianni de Fabritiis
Developing accurate and efficient coarse-grained representations of proteins is crucial for understanding their folding, function, and interactions over extended timescales. Our methodology involves simulating proteins with molecular dynamics and utilizing the resulting trajectories to train a neural network potential through differentiable trajectory reweighting. Remarkably, this method requires only the native conformation of proteins, eliminating the need for labeled data derived from extensive simulations or memory-intensive end-to-end differentiable simulations. Once trained, the model can be employed to run parallel molecular dynamics simulations and sample folding events for proteins both within and beyond the training distribution, showcasing its extrapolation capabilities. By applying Markov State Models, native-like conformations of the simulated proteins can be predicted from the coarse-grained simulations. Owing to its theoretical transferability and ability to use solely experimental static structures as training data, we anticipate that this approach will prove advantageous for developing new protein force fields and further advancing the study of protein dynamics, folding, and interactions.
BMJul 15, 2024
On Machine Learning Approaches for Protein-Ligand Binding Affinity PredictionNikolai Schapin, Carles Navarro, Albert Bou et al.
Binding affinity optimization is crucial in early-stage drug discovery. While numerous machine learning methods exist for predicting ligand potency, their comparative efficacy remains unclear. This study evaluates the performance of classical tree-based models and advanced neural networks in protein-ligand binding affinity prediction. Our comprehensive benchmarking encompasses 2D models utilizing ligand-only RDKit embeddings and Large Language Model (LLM) ligand representations, as well as 3D neural networks incorporating bound protein-ligand conformations. We assess these models across multiple standard datasets, examining various predictive scenarios including classification, ranking, regression, and active learning. Results indicate that simpler models can surpass more complex ones in specific tasks, while 3D models leveraging structural information become increasingly competitive with larger training datasets containing compounds with labelled affinity data against multiple targets. Pre-trained 3D models, by incorporating protein pocket environments, demonstrate significant advantages in data-scarce scenarios for specific binding pockets. Additionally, LLM pretraining on 2D ligand data enhances complex model performance, providing versatile embeddings that outperform traditional RDKit features in computational efficiency. Finally, we show that combining 2D and 3D model strengths improves active learning outcomes beyond current state-of-the-art approaches. These findings offer valuable insights for optimizing machine learning strategies in drug discovery pipelines.
LGApr 21
Structure-guided molecular design with contrastive 3D protein-ligand learningCarles Navarro, Philipp Tholke, Gianni de Fabritiis
Structure-based drug discovery faces the dual challenge of accurately capturing 3D protein-ligand interactions while navigating ultra-large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)-equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero-shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target-specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.
BMOct 1, 2025
Speak to a Protein: An Interactive Multimodal Co-Scientist for Protein AnalysisCarles Navarro, Mariona Torrens, Philipp Thölke et al.
Building a working mental model of a protein typically requires weeks of reading, cross-referencing crystal and predicted structures, and inspecting ligand complexes, an effort that is slow, unevenly accessible, and often requires specialized computational skills. We introduce \emph{Speak to a Protein}, a new capability that turns protein analysis into an interactive, multimodal dialogue with an expert co-scientist. The AI system retrieves and synthesizes relevant literature, structures, and ligand data; grounds answers in a live 3D scene; and can highlight, annotate, manipulate and see the visualization. It also generates and runs code when needed, explaining results in both text and graphics. We demonstrate these capabilities on relevant proteins, posing questions about binding pockets, conformational changes, or structure-activity relationships to test ideas in real-time. \emph{Speak to a Protein} reduces the time from question to evidence, lowers the barrier to advanced structural analysis, and enables hypothesis generation by tightly coupling language, code, and 3D structures. \emph{Speak to a Protein} is freely accessible at https://open.playmolecule.org.