QMNov 20, 2022
Karyotype AI for Precision OncologyZahra Shamsi, Isaac Reid, Drew Bryant et al.
We present a machine learning method capable of accurately detecting chromosome abnormalities that cause blood cancers directly from microscope images of the metaphase stage of cell division. The pipeline is built on a series of fine-tuned Vision Transformers. Current state of the art (and standard clinical practice) requires expensive, manual expert analysis, whereas our pipeline takes only 15 seconds per metaphase image. Using a novel pretraining-finetuning strategy to mitigate the challenge of data scarcity, we achieve a high precision-recall score of 94% AUC for the clinically significant del(5q) and t(9;22) anomalies. Our method also unlocks zero-shot detection of rare aberrations based on model latent embeddings. The ability to quickly, accurately, and scalably diagnose genetic abnormalities directly from metaphase images could transform karyotyping practice and improve patient outcomes. We will make code publicly available.
CLMar 14, 2025Code
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and ReasoningHao Cui, Zahra Shamsi, Gowoon Cheon et al.
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie
87.2AIMay 15
Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree searchSarah Martinson, Michael P. Brenner, Martyna Plomecka et al.
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
AISep 8, 2025
An AI system to help scientists write expert-level empirical softwareEser Aygün, Anastasiya Belyaeva, Gheorghe Comanici et al.
The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.