Janghoon Ock

CL
h-index43
9papers
54citations
Novelty48%
AI Score49

9 Papers

CLMar 1
Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent

Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani

The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.

LGNov 13, 2024
UniMat: Unifying Materials Embeddings through Multi-modal Learning

Janghoon Ock, Joseph Montoya, Daniel Schweigert et al.

Materials science datasets are inherently heterogeneous and are available in different modalities such as characterization spectra, atomic structures, microscopic images, and text-based synthesis conditions. The advancements in multi-modal learning, particularly in vision and language models, have opened new avenues for integrating data in different forms. In this work, we evaluate common techniques in multi-modal learning (alignment and fusion) in unifying some of the most important modalities in materials science: atomic structure, X-ray diffraction patterns (XRD), and composition. We show that structure graph modality can be enhanced by aligning with XRD patterns. Additionally, we show that aligning and fusing more experimentally accessible data formats, such as XRD patterns and compositions, can create more robust joint embeddings than individual modalities across various tasks. This lays the groundwork for future studies aiming to exploit the full potential of multi-modal data in materials science, facilitating more informed decision-making in materials design and discovery.

CLOct 22, 2024
Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Janghoon Ock, Radheesh Sharma Meda, Tirtha Vinchurkar et al.

Adsorption energy is a key reactivity descriptor in catalysis. Determining adsorption energy requires evaluating numerous adsorbate-catalyst configurations, making it computationally intensive. Current methods rely on exhaustive sampling, which does not guarantee the identification of the global minimum energy. To address this, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify stable adsorption configurations corresponding to the global minimum energy. Adsorb-Agent leverages its built-in knowledge and reasoning to strategically explore configurations, significantly reducing the number of initial setups required while improving energy prediction accuracy. In this study, we also evaluated the performance of different LLMs, including GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-Chat, as the reasoning engine for Adsorb-Agent, with GPT-4o showing the strongest overall performance. Tested on twenty diverse systems, Adsorb-Agent identifies comparable adsorption energies for 84% of cases and achieves lower energies for 35%, particularly excelling in complex systems. It identifies lower energies in 47% of intermetallic systems and 67% of systems with large adsorbates. These findings demonstrate Adsorb-Agent's potential to accelerate catalyst discovery by reducing computational costs and enhancing prediction reliability compared to exhaustive search methods.

LGJun 26, 2025
LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock et al.

Chemical process optimization maximizes production efficiency and economic performance, but optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable. We present a multi-agent LLM framework that autonomously infers operating constraints from minimal process descriptions, then collaboratively guides optimization. Our AutoGen-based framework employs OpenAI's o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance. Through autonomous constraint generation and iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on hydrodealkylation across cost, yield, and yield-to-cost ratio metrics, the framework achieved competitive performance with conventional methods while reducing wall-time 31-fold relative to grid search, converging in under 20 minutes. The reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs and applying domain-informed heuristics. Unlike conventional methods requiring predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable parameter exploration. Model comparison reveals reasoning-capable architectures (o3, o1) are essential for successful optimization, while standard models fail to converge. This approach is particularly valuable for emerging processes and retrofit applications where operational constraints are poorly characterized or unavailable.

LGJun 26, 2025
Large Language Model Agent for Modular Task Execution in Drug Discovery

Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan et al.

We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

CLFeb 27, 2025
NANOGPT: A Query-Driven Large Language Model Retrieval-Augmented Generation System for Nanotechnology Research

Achuth Chandrasekhar, Omid Barati Farimani, Olabode T. Ajenifujah et al.

This paper presents the development and application of a Large Language Model Retrieval-Augmented Generation (LLM-RAG) system tailored for nanotechnology research. The system leverages the capabilities of a sophisticated language model to serve as an intelligent research assistant, enhancing the efficiency and comprehensiveness of literature reviews in the nanotechnology domain. Central to this LLM-RAG system is its advanced query backend retrieval mechanism, which integrates data from multiple reputable sources. The system retrieves relevant literature by utilizing Google Scholar's advanced search, and scraping open-access papers from Elsevier, Springer Nature, and ACS Publications. This multifaceted approach ensures a broad and diverse collection of up-to-date scholarly articles and papers. The proposed system demonstrates significant potential in aiding researchers by providing a streamlined, accurate, and exhaustive literature retrieval process, thereby accelerating research advancements in nanotechnology. The effectiveness of the LLM-RAG system is validated through rigorous testing, illustrating its capability to significantly reduce the time and effort required for comprehensive literature reviews, while maintaining high accuracy, query relevance and outperforming standard, publicly available LLMS.

CLJan 7, 2025
Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

Ying-Ting Yeh, Janghoon Ock, Achuth Chandrasekhar et al.

We investigate transformer-based language models, including RoBERTa, T5, Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor materials directly from textual descriptions. The inputs encode key material features, such as chemical composition, crystal system, space group, and other structural and electronic properties. Unlike shallow machine learning models, which require extensive feature engineering, or Graph Neural Networks, which rely on graph representations derived from atomic coordinates, pretrained language models can process textual inputs directly, eliminating the need for manual feature preprocessing or structure-based encoding. Material descriptions were constructed in two formats: structured strings with a consistent template and natural language narratives generated via the ChatGPT API. Each model was augmented with a custom regression head and finetuned for band gap prediction task. Language models of different architectures and parameter sizes were all able to predict band gaps from human-readable text with strong accuracy, achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT, pretrained on materials science literature, reached comparable performance (MAE 0.288 eV, R2 0.871) with significantly fewer parameters (110 million), emphasizing the importance of domain-specific pretraining. Attention analysis shows that both models selectively focus on compositional and spin-related features while de-emphasizing geometric features, reflecting the difficulty of capturing spatial information from text. These results establish that pretrained language models can effectively extract complex feature-property relationships from textual material descriptions.

LGOct 23, 2025
Meta-Learning for Cross-Task Generalization in Protein Mutation Property Prediction

Srivathsan Badrinarayanan, Yue Su, Janghoon Ock et al.

Protein mutations can have profound effects on biological function, making accurate prediction of property changes critical for drug discovery, protein engineering, and precision medicine. Current approaches rely on fine-tuning protein-specific transformers for individual datasets, but struggle with cross-dataset generalization due to heterogeneous experimental conditions and limited target domain data. We introduce two key innovations: (1) the first application of Model-Agnostic Meta-Learning (MAML) to protein mutation property prediction, and (2) a novel mutation encoding strategy using separator tokens to directly incorporate mutations into sequence context. We build upon transformer architectures integrating them with MAML to enable rapid adaptation to new tasks through minimal gradient steps rather than learning dataset-specific patterns. Our mutation encoding addresses the critical limitation where standard transformers treat mutation positions as unknown tokens, significantly degrading performance. Evaluation across three diverse protein mutation datasets (functional fitness, thermal stability, and solubility) demonstrates significant advantages over traditional fine-tuning. In cross-task evaluation, our meta-learning approach achieves 29% better accuracy for functional fitness with 65% less training time, and 94% better accuracy for solubility with 55% faster training. The framework maintains consistent training efficiency regardless of dataset size, making it particularly valuable for industrial applications and early-stage protein design where experimental data is limited. This work establishes a systematic application of meta-learning to protein mutation analysis and introduces an effective mutation encoding strategy, offering transformative methodology for cross-domain generalization in protein engineering.

CHEM-PHJun 17, 2025
Beyond Force Metrics: Pre-Training MLFFs for Stable MD Simulations

Shagun Maheshwari, Janghoon Ock, Adeesh Kolluru et al.

Machine-learning force fields (MLFFs) have emerged as a promising solution for speeding up ab initio molecular dynamics (MD) simulations, where accurate force predictions are critical but often computationally expensive. In this work, we employ GemNet-T, a graph neural network model, as an MLFF and investigate two training strategies: (1) direct training on MD17 (10K samples) without pre-training, and (2) pre-training on the large-scale OC20 dataset followed by fine-tuning on MD17 (10K). While both approaches achieve low force mean absolute errors (MAEs), reaching 5 meV/A per atom, we find that lower force errors do not necessarily guarantee stable MD simulations. Notably, the pre-trained GemNet-T model yields significantly improved simulation stability, sustaining trajectories up to three times longer than the model trained from scratch. These findings underscore the value of pre-training on large, diverse datasets to capture complex molecular interactions and highlight that force MAE alone is not always a sufficient metric of MD simulation stability.