Sophie Ostmeier

CV
h-index21
16papers
337citations
Novelty50%
AI Score62

16 Papers

IVSep 26, 2022Code
USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Sophie Ostmeier, Brian Axelrod, Jeroen Bertels et al.

Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git

CVMar 26Code
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)

Gustavo Chau Loo Kung, Mohammad Abbasi, Camila Blank et al.

Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE.

IVNov 24, 2022
Non-inferiority of Deep Learning Acute Ischemic Stroke Segmentation on Non-Contrast CT Compared to Expert Neuroradiologists

Sophie Ostmeier, Brian Axelrod, Benjamin F. J. Verhaaren et al.

To determine if a convolutional neural network (CNN) deep learning model can accurately segment acute ischemic changes on non-contrast CT compared to neuroradiologists. Non-contrast CT (NCCT) examinations from 232 acute ischemic stroke patients who were enrolled in the DEFUSE 3 trial were included in this study. Three experienced neuroradiologists independently segmented hypodensity that reflected the ischemic core on each scan. The neuroradiologist with the most experience (expert A) served as the ground truth for deep learning model training. Two additional neuroradiologists (experts B and C) segmentations were used for data testing. The 232 studies were randomly split into training and test sets. The training set was further randomly divided into 5 folds with training and validation sets. A 3-dimensional CNN architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics using non-inferiority thresholds of 20%, 3ml, and 3mm. The optimized model trained on expert A was compared to test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. The final model performance for the ischemic core segmentation task reached a performance of 0.46+-0.09 Surface Dice at Tolerance 5mm and 0.47+-0.13 Dice when trained on expert A. Compared to the two test neuroradiologists the model-expert agreement was non-inferior to the inter-expert agreement, p < 0.05. The CNN accurately delineates the hypodense ischemic core on NCCT in acute ischemic stroke patients with an accuracy comparable to neuroradiologists.

CVSep 7, 2023
Random Expert Sampling for Deep Learning Segmentation of Acute Ischemic Stroke on Non-contrast CT

Sophie Ostmeier, Brian Axelrod, Benjamin Pulli et al.

Purpose: Multi-expert deep learning training methods to automatically quantify ischemic brain tissue on Non-Contrast CT Materials and Methods: The data set consisted of 260 Non-Contrast CTs from 233 patients of acute ischemic stroke patients recruited in the DEFUSE 3 trial. A benchmark U-Net was trained on the reference annotations of three experienced neuroradiologists to segment ischemic brain tissue using majority vote and random expert sampling training schemes. We used a one-sided Wilcoxon signed-rank test on a set of segmentation metrics to compare bootstrapped point estimates of the training schemes with the inter-expert agreement and ratio of variance for consistency analysis. We further compare volumes with the 24h-follow-up DWI (final infarct core) in the patient subgroup with full reperfusion and we test volumes for correlation to the clinical outcome (mRS after 30 and 90 days) with the Spearman method. Results: Random expert sampling leads to a model that shows better agreement with experts than experts agree among themselves and better agreement than the agreement between experts and a majority-vote model performance (Surface Dice at Tolerance 5mm improvement of 61% to 0.70 +- 0.03 and Dice improvement of 25% to 0.50 +- 0.04). The model-based predicted volume similarly estimated the final infarct volume and correlated better to the clinical outcome than CT perfusion. Conclusion: A model trained on random expert sampling can identify the presence and location of acute ischemic brain tissue on Non-Contrast CT similar to CT perfusion and with better consistency than experts. This may further secure the selection of patients eligible for endovascular treatment in less specialized hospitals.

CVNov 30, 2025Code
TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier et al.

Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TRoVe, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground-truth annotations for learned static feature biases. We use this framework to demonstrate that TRoVe can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TRoVe to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previously-unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time. Our code is available at https://github.com/Stanford-AIMI/TRoVe.

CVApr 1
A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Yabin Zhang, Chong Wang, Yunhe Gao et al.

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

IVFeb 20, 2025Code
MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders

Maya Varma, Ashwin Kumar, Rogier van der Sluijs et al.

Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at https://github.com/StanfordMIMI/MedVAE.

HCMar 14
Clinician input steers frontier AI models toward both accurate and harmful decisions

Ivan Lopez, Selin S. Everett, Bryan J. Bunning et al.

Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during clinical interactions. We combined 61 New England Journal of Medicine Case Records with 92 real-world clinician-AI interactions to evaluate 21 reasoning LLM variants across 8 frontier models on differential diagnosis generation and next step recommendations under three conditions: reasoning alone, after expert clinician context, and after adversarial clinician context. LLM-clinician concordance increased substantially after clinician exposure, with simulations sharing >=3 differential diagnosis items rising from 65.8% to 93.5% and >=3 next step recommendations from 20.3% to 53.8%. Expert context significantly improved correct final diagnosis inclusion across all 21 models (mean +20.4 percentage points), reflecting both reasoning improvement and passive content echoing, while adversarial context caused significant diagnostic degradation in 14 models (mean -5.4 percentage points). Multi-turn disagreement probes revealed distinct model phenotypes ranging from highly conformist to dogmatic, with adversarial arguments remaining a persistent vulnerability even for otherwise resilient models. Inference-time scaling reduced harmful echoing of clinician-introduced recommendations across WHO-defined harm severity tiers (relative reductions: 62.7% mild, 57.9% moderate, 76.3% severe, 83.5% death-tier). In GPT-4o experiments, explicit clinician uncertainty signals improved diagnostic performance after adversarial context (final diagnosis inclusion 27% to 42%) and reduced alignment with incorrect arguments by 21%. These findings establish a foundation for evaluating clinician-AI collaboration, introducing interactive metrics and mitigation strategies essential for safety and robustness.

CVMay 11
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang et al.

Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.

CLJul 3, 2025Code
MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali, Vasiliki Bikia, Maya Varma et al. · stanford

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

CLMay 30, 2025Code
Structuring Radiology Reports: Challenging LLMs with Lightweight Models

Johannes Moll, Louisa Fay, Asfandyar Azhar et al.

Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.

CVJun 14, 2024Code
LieRE: Lie Rotational Positional Encodings

Sophie Ostmeier, Brian Axelrod, Maya Varma et al.

Transformer architectures rely on position encodings to model the spatial structure of input data. Rotary Position Encoding (RoPE) is a widely used method in language models that encodes relative positions through fixed, block-diagonal, rotation matrices applied to key-query interactions. We hypothesize that this inductive bias limits their RoPE's effectiveness for modalities with high dimensional structure. Lie Relative Encodings (LieRE) introduce a principled generalization of RoPE, aimed at increasing the representational capacity of positional encodings in transformers. Instead of fixed 2D rotations, LieRE learns dense skew-symmetric matrices (Lie algebra elements), which are then differentiable mapped to form high-dimensional rotation matrices (Lie group elements). This results in richer, learnable, and continuous, encodings of both relative and absolute positional information. We demonstrate the effectiveness of LieRE on 2D and 3D vision tasks, showing that it generalizes well to higher input resolutions while maintaining computational efficiency. The code and checkpoints are publicly available at https://github.com/StanfordMIMI/LieRE.

CVJun 10, 2024Code
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen et al.

The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.

CLMay 6, 2024Code
GREEN: Generative Radiology Report Evaluation and Error Notation

Sophie Ostmeier, Justin Xu, Zhihong Chen et al.

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.

CLMay 30, 2025
Automated Structured Radiology Report Generation

Jean-Benoit Delbrouck, Justin Xu, Johannes Moll et al. · oxford, stanford

Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT's hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.

CVOct 12, 2025
From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis

Clemence Mottez, Louisa Fay, Maya Varma et al.

Deep learning models have shown promise in improving diagnostic accuracy from chest X-rays, but they also risk perpetuating healthcare disparities when performance varies across demographic groups. In this work, we present a comprehensive bias detection and mitigation framework targeting sex, age, and race-based disparities when performing diagnostic tasks with chest X-rays. We extend a recent CNN-XGBoost pipeline to support multi-label classification and evaluate its performance across four medical conditions. We show that replacing the final layer of CNN with an eXtreme Gradient Boosting classifier improves the fairness of the subgroup while maintaining or improving the overall predictive performance. To validate its generalizability, we apply the method to different backbones, namely DenseNet-121 and ResNet-50, and achieve similarly strong performance and fairness outcomes, confirming its model-agnostic design. We further compare this lightweight adapter training method with traditional full-model training bias mitigation techniques, including adversarial training, reweighting, data augmentation, and active learning, and find that our approach offers competitive or superior bias reduction at a fraction of the computational cost. Finally, we show that combining eXtreme Gradient Boosting retraining with active learning yields the largest reduction in bias across all demographic subgroups, both in and out of distribution on the CheXpert and MIMIC datasets, establishing a practical and effective path toward equitable deep learning deployment in clinical radiology.