CLMar 14, 2025Code
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and ReasoningHao Cui, Zahra Shamsi, Gowoon Cheon et al.
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
MTRL-SCIDec 5, 2020
Dataset of Random Relaxations for Crystal Structure Search of Li-Si SystemGowoon Cheon, Lusann Yang, Kevin McCloskey et al.
Crystal structure search is a long-standing challenge in materials design. We present a dataset of more than 100,000 structural relaxations of potential battery anode materials from randomized structures using density functional theory calculations. We illustrate the usage of the dataset by training graph neural networks to predict structural relaxations from randomly generated structures. Our models directly predict stresses in addition to forces, which allows them to accurately simulate relaxations of both ionic positions and lattice vectors. We show that models trained on the molecular dynamics simulations fail to simulate relaxations from random structures, while training on our data leads to up to two orders of magnitude decrease in error for the same task. Our model is able to find an experimentally verified structure of a stoichiometry held out from training. We find that randomly perturbing atomic positions during training improves both the accuracy and out of domain generalization of the models.