Justin Diamond

LG
4papers
6citations
Novelty49%
AI Score34

4 Papers

LGJul 10, 2023
Geometric Constraints in Probabilistic Manifolds: A Bridge from Molecular Dynamics to Structured Diffusion Processes

Justin Diamond, Markus Lill

Understanding the macroscopic characteristics of biological complexes demands precision and specificity in statistical ensemble modeling. One of the primary challenges in this domain lies in sampling from particular subsets of the state-space, driven either by existing structural knowledge or specific areas of interest within the state-space. We propose a method that enables sampling from distributions that rigorously adhere to arbitrary sets of geometric constraints in Euclidean spaces. This is achieved by integrating a constraint projection operator within the well-regarded architecture of Denoising Diffusion Probabilistic Models, a framework founded in generative modeling and probabilistic inference. The significance of this work becomes apparent, for instance, in the context of deep learning-based drug design, where it is imperative to maintain specific molecular profile interactions to realize the desired therapeutic outcomes and guarantee safety.

CLMar 31, 2023
"Genlangs" and Zipf's Law: Do languages generated by ChatGPT statistically look human?

Justin Diamond

OpenAI's GPT-4 is a Large Language Model (LLM) that can generate coherent constructed languages, or "conlangs," which we propose be called "genlangs" when generated by Artificial Intelligence (AI). The genlangs created by ChatGPT for this research (Voxphera, Vivenzia, and Lumivoxa) each have unique features, appear facially coherent, and plausibly "translate" into English. This study investigates whether genlangs created by ChatGPT follow Zipf's law. Zipf's law approximately holds across all natural and artificially constructed human languages. According to Zipf's law, the word frequencies in a text corpus are inversely proportional to their rank in the frequency table. This means that the most frequent word appears about twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. We hypothesize that Zipf's law will hold for genlangs because (1) genlangs created by ChatGPT fundamentally operate in the same way as human language with respect to the semantic usefulness of certain tokens, and (2) ChatGPT has been trained on a corpora of text that includes many different languages, all of which exhibit Zipf's law to varying degrees. Through statistical linguistics, we aim to understand if LLM-based languages statistically look human. Our findings indicate that genlangs adhere closely to Zipf's law, supporting the hypothesis that genlangs created by ChatGPT exhibit similar statistical properties to natural and artificial human languages. We also conclude that with human assistance, AI is already capable of creating the world's first fully-functional genlang, and we call for its development.

LGNov 21, 2025
Diffusion Models are Molecular Dynamics Simulators

Justin Diamond, Markus Lill

We prove that a denoising diffusion sampler equipped with a sequential bias across the batch dimension is exactly an Euler-Maruyama integrator for overdamped Langevin dynamics. Each reverse denoising step, with its associated spring stiffness, can be interpreted as one step of a stochastic differential equation with an effective time step set jointly by the noise schedule and that stiffness. The learned score then plays the role of the drift, equivalently the gradient of a learned energy, yielding a precise correspondence between diffusion sampling and Langevin time evolution. This equivalence recasts molecular dynamics (MD) in terms of diffusion models. Accuracy is no longer tied to a fixed, extremely small MD time step; instead, it is controlled by two scalable knobs: model capacity, which governs how well the drift is approximated, and the number of denoising steps, which sets the integrator resolution. In practice, this leads to a fully data-driven MD framework that learns forces from uncorrelated equilibrium snapshots, requires no hand-engineered force fields, uses no trajectory data for training, and still preserves the Boltzmann distribution associated with the learned energy. We derive trajectory-level, information-theoretic error bounds that cleanly separate discretization error from score-model error, clarify how temperature enters through the effective spring, and show that the resulting sampler generates molecular trajectories with MD-like temporal correlations, even though the model is trained only on static configurations.

BMMay 31, 2023
Improving Protein-peptide Interface Predictions in the Low Data Regime

Justin Diamond, Markus Lill

We propose a novel approach for predicting protein-peptide interactions using a bi-modal transformer architecture that learns an inter-facial joint distribution of residual contacts. The current data sets for crystallized protein-peptide complexes are limited, making it difficult to accurately predict interactions between proteins and peptides. To address this issue, we propose augmenting the existing data from PepBDB with pseudo protein-peptide complexes derived from the PDB. The augmented data set acts as a method to transfer physics-based contextdependent intra-residue (within a domain) interactions to the inter-residual (between) domains. We show that the distributions of inter-facial residue-residue interactions share overlap with inter residue-residue interactions, enough to increase predictive power of our bi-modal transformer architecture. In addition, this dataaugmentation allows us to leverage the vast amount of protein-only data available in the PDB to train neural networks, in contrast to template-based modeling that acts as a prior