Shijing Sun

MTRL-SCI
h-index26
6papers
435citations
Novelty29%
AI Score32

6 Papers

CHEM-PHJul 28, 2023
Lessons in Reproducibility: Insights from NLP Studies in Materials Science

Xiangyun Lei, Edward Kim, Viktoriia Baibakova et al.

Natural Language Processing (NLP), a cornerstone field within artificial intelligence, has been increasingly utilized in the field of materials science literature. Our study conducts a reproducibility analysis of two pioneering works within this domain: "Machine-learned and codified synthesis parameters of oxide materials" by Kim et al., and "Unsupervised word embeddings capture latent knowledge from materials science literature" by Tshitoyan et al. We aim to comprehend these studies from a reproducibility perspective, acknowledging their significant influence on the field of materials informatics, rather than critiquing them. Our study indicates that both papers offered thorough workflows, tidy and well-documented codebases, and clear guidance for model evaluation. This makes it easier to replicate their results successfully and partially reproduce their findings. In doing so, they set commendable standards for future materials science publications to aspire to. However, our analysis also highlights areas for improvement such as to provide access to training data where copyright restrictions permit, more transparency on model architecture and the training process, and specifications of software dependency versions. We also cross-compare the word embedding models between papers, and find that some key differences in reproducibility and cross-compatibility are attributable to design choices outside the bounds of the models themselves. In summary, our study appreciates the benchmark set by these seminal papers while advocating for further enhancements in research reproducibility practices in the field of NLP for materials science. This balance of understanding and continuous improvement will ultimately propel the intersecting domains of NLP and materials science literature into a future of exciting discoveries.

MTRL-SCIAug 15, 2025
The Rise of Generative AI for Metal-Organic Framework Design and Synthesis

Chenru Duan, Aditya Nandy, Shyam Chand Pal et al.

Advances in generative artificial intelligence are transforming how metal-organic frameworks (MOFs) are designed and discovered. This Perspective introduces the shift from laborious enumeration of MOF candidates to generative approaches that can autonomously propose and synthesize in the laboratory new porous reticular structures on demand. We outline the progress of employing deep learning models, such as variational autoencoders, diffusion models, and large language model-based agents, that are fueled by the growing amount of available data from the MOF community and suggest novel crystalline materials designs. These generative tools can be combined with high-throughput computational screening and even automated experiments to form accelerated, closed-loop discovery pipelines. The result is a new paradigm for reticular chemistry in which AI algorithms more efficiently direct the search for high-performance MOF materials for clean air and energy applications. Finally, we highlight remaining challenges such as synthetic feasibility, dataset diversity, and the need for further integration of domain knowledge.

MTRL-SCIOct 8, 2021
Opportunities for Machine Learning to Accelerate Halide Perovskite Commercialization and Scale-Up

Rishi E. Kumar, Armi Tiihonen, Shijing Sun et al.

While halide perovskites attract significant academic attention, examples of at-scale industrial production are still sparse. In this perspective, we review practical challenges hindering the commercialization of halide perovskites, and discuss how machine-learning (ML) tools could help: (1) active-learning algorithms that blend institutional knowledge and human expertise could help stabilize and rapidly update baseline manufacturing processes; (2) ML-powered metrology, including computer imaging, could help narrow the performance gap between large- and small-area devices; and (3) inference methods could help accelerate root-cause analysis by reconciling multiple data streams and simulations, focusing research effort on areas with highest probability for improvement. We conclude that to satisfy many of these challenges, incremental -- not radical -- adaptations of existing ML and statistical methods are needed. We identify resources to help develop in-house data-science talent, and propose how industry-academic partnerships could help adapt "ready-now" ML tools to specific industry needs, further improve process control by revealing underlying mechanisms, and develop "gamechanger" discovery-oriented algorithms to better navigate vast materials combination spaces and the literature.

MTRL-SCIMay 23, 2021
Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science Domains

Qiaohao Liang, Aldair E. Gongora, Zekun Ren et al.

In the field of machine learning (ML) for materials optimization, active learning algorithms, such as Bayesian Optimization (BO), have been leveraged for guiding autonomous and high-throughput experimentation systems. However, very few studies have evaluated the efficiency of BO as a general optimization algorithm across a broad range of experimental materials science domains. In this work, we evaluate the performance of BO algorithms with a collection of surrogate model and acquisition function pairs across five diverse experimental materials systems, namely carbon nanotube polymer blends, silver nanoparticles, lead-halide perovskites, as well as additively manufactured polymer structures and shapes. By defining acceleration and enhancement metrics for general materials optimization objectives, we find that for surrogate model selection, Gaussian Process (GP) with anisotropic kernels (automatic relevance detection, ARD) and Random Forests (RF) have comparable performance and both outperform the commonly used GP without ARD. We discuss the implicit distributional assumptions of RF and GP, and the benefits of using GP with anisotropic kernels in detail. We provide practical insights for experimentalists on surrogate model selection of BO during materials optimization campaigns.

COMP-PHMay 15, 2020
An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties

Zekun Ren, Siyu Isaac Parker Tian, Juhwan Noh et al.

Realizing general inverse design could greatly accelerate the discovery of new materials with user-defined properties. However, state-of-the-art generative models tend to be limited to a specific composition or crystal structure. Herein, we present a framework capable of general inverse design (not limited to a given set of elements or crystal structures), featuring a generalized invertible representation that encodes crystals in both real and reciprocal space, and a property-structured latent space from a variational autoencoder (VAE). In three design cases, the framework generates 142 new crystals with user-defined formation energies, bandgap, thermoelectric (TE) power factor, and combinations thereof. These generated crystals, absent in the training database, are validated by first-principles calculations. The success rates (number of first-principles-validated target-satisfying crystals/number of designed crystals) ranges between 7.1% and 38.9%. These results represent a significant step toward property-driven general inverse design using generative models, although practical challenges remain when coupled with experimental synthesis.

DATA-ANNov 20, 2018
Fast and interpretable classification of small X-ray diffraction datasets using data augmentation and deep neural networks

Felipe Oviedo, Zekun Ren, Shijing Sun et al.

X-ray diffraction (XRD) data acquisition and analysis is among the most time-consuming steps in the development cycle of novel thin-film materials. We propose a machine-learning-enabled approach to predict crystallographic dimensionality and space group from a limited number of thin-film XRD patterns. We overcome the scarce-data problem intrinsic to novel materials development by coupling a supervised machine learning approach with a model agnostic, physics-informed data augmentation strategy using simulated data from the Inorganic Crystal Structure Database (ICSD) and experimental data. As a test case, 115 thin-film metal halides spanning 3 dimensionalities and 7 space-groups are synthesized and classified. After testing various algorithms, we develop and implement an all convolutional neural network, with cross validated accuracies for dimensionality and space-group classification of 93% and 89%, respectively. We propose average class activation maps, computed from a global average pooling layer, to allow high model interpretability by human experimentalists, elucidating the root causes of misclassification. Finally, we systematically evaluate the maximum XRD pattern step size (data acquisition rate) before loss of predictive accuracy occurs, and determine it to be 0.16°, which enables an XRD pattern to be obtained and classified in 5.5 minutes or less.