CYApr 30
Adoption and Use of LLMs at an Academic Medical CenterNigam H. Shah, Nerissa Ambers, Abby Pandya et al.
While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with "workflow friction" from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a "build-from-within" strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.
LGMay 22
Private Adaptive Covariance Estimation via Gaussian Graphical ModelsCecilia Ferrando, Miguel Fuentes, Brett Mullins et al.
We propose PACE-GGM, a data-adaptive differentially private method for covariance estimation that concentrates its privacy budget on the most informative entries of the empirical covariance matrix, rather than perturbing all entries. This applies in the natural setting where the modeler supplies separate bounds for each variable, so that individual entries can be measured with less noise than the full matrix. In each round, our method selects a poorly approximated entry, measures it using the Gaussian mechanism, and then reconstructs a full covariance matrix using a maximum-entropy reconstruction objective, leading to a Gaussian graphical model structure. Experiments on diverse real-world datasets demonstrate consistent improvements in estimation error with respect to the Gaussian mechanism and other baselines, particularly in high-dimensional and low-to-moderate privacy regimes.
OTDec 9, 2025
Monitoring Deployed AI Systems in Health CareTimothy Keyes, Alison Callahan, Abby S. Pandya et al.
Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
CLApr 28
Training-Free Adaptation of New-Generation LLMs using Legacy Clinical ModelsSasha Ronaghi, Chloe Stanwyck, Asad Aali et al.
Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6\% over UniTE, +41.4\% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.
CLNov 25, 2025Code
Structured Prompting Enables More Robust Evaluation of Language ModelsAsad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia et al.
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we approximate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks ($+$2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing chain-of-thought reduces LM sensitivity to prompt design (smaller $Δ$ across prompts). To our knowledge, this is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework, demonstrating how scalable performance-ceiling approximation yields more robust, decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
LGMar 12, 2024
Joint Selection: Adaptively Incorporating Public Information for Private Synthetic DataMiguel Fuentes, Brett Mullins, Ryan McKenna et al.
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
SOC-PHSep 23, 2014
Does network complexity help organize Babel's library?Juan Pablo Cárdenas, Iván González, Gerardo Vidal et al.
In this work, we study properties of texts from the perspective of complex network theory. Words in given texts are linked by co-occurrence and transformed into networks, and we observe that these display topological properties common to other complex systems. However, there are some properties that seem to be exclusive to texts; many of these properties depend on the frequency of words in the text, while others seem to be strictly determined by the grammar. Precisely, these properties allow for a categorization of texts as either with a sense and others encoded or senseless.