Bing Ren

IV
h-index49
13papers
800citations
Novelty33%
AI Score33

13 Papers

CLNov 14, 2024Code
A Benchmark for Long-Form Medical Question Answering

Pedram Hosseini, Jessica M. Sin, Bing Ren et al.

There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: https://github.com/lavita-ai/medical-eval-sphere

CLFeb 20, 2025
ALFA: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning

Shuyue Stella Li, Jimin Mun, Faeze Brahman et al. · allen-ai, cmu

Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decision-making. We present ALignment via Fine-grained Attributes, (ALFA) a framework that improves LLM question-asking by (i) decomposing the notion of a "good" question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SoTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.

CVOct 25, 2024
Deep Learning for Classification of Inflammatory Bowel Disease Activity in Whole Slide Images of Colonic Histopathology

Amit Das, Tanmay Shukla, Naofumi Tomita et al.

Grading inflammatory bowel disease (IBD) activity using standardized histopathological scoring systems remains challenging due to resource constraints and inter-observer variability. In this study, we developed a deep learning model to classify activity grades in hematoxylin and eosin-stained whole slide images (WSIs) from patients with IBD, offering a robust approach for general pathologists. We utilized 2,077 WSIs from 636 patients treated at Dartmouth-Hitchcock Medical Center in 2018 and 2019, scanned at 40x magnification (0.25 micron/pixel). Board-certified gastrointestinal pathologists categorized the WSIs into four activity classes: inactive, mildly active, moderately active, and severely active. A transformer-based model was developed and validated using five-fold cross-validation to classify IBD activity. Using HoVerNet, we examined neutrophil distribution across activity grades. Attention maps from our model highlighted areas contributing to its prediction. The model classified IBD activity with weighted averages of 0.871 [95% Confidence Interval (CI): 0.860-0.883] for the area under the curve, 0.695 [95% CI: 0.674-0.715] for precision, 0.697 [95% CI: 0.678-0.716] for recall, and 0.695 [95% CI: 0.674-0.714] for F1-score. Neutrophil distribution was significantly different across activity classes. Qualitative evaluation of attention maps by a gastrointestinal pathologist suggested their potential for improved interpretability. Our model demonstrates robust diagnostic performance and could enhance consistency and efficiency in IBD activity assessment.

IVJun 18, 2025
Cross-Modality Learning for Predicting IHC Biomarkers from H&E-Stained Whole-Slide Images

Amit Das, Naofumi Tomita, Kyle J. Syme et al.

Hematoxylin and Eosin (H&E) staining is a cornerstone of pathological analysis, offering reliable visualization of cellular morphology and tissue architecture for cancer diagnosis, subtyping, and grading. Immunohistochemistry (IHC) staining provides molecular insights by detecting specific proteins within tissues, enhancing diagnostic accuracy, and improving treatment planning. However, IHC staining is costly, time-consuming, and resource-intensive, requiring specialized expertise. To address these limitations, this study proposes HistoStainAlign, a novel deep learning framework that predicts IHC staining patterns directly from H&E whole-slide images (WSIs) by learning joint representations of morphological and molecular features. The framework integrates paired H&E and IHC embeddings through a contrastive training strategy, capturing complementary features across staining modalities without patch-level annotations or tissue registration. The model was evaluated on gastrointestinal and lung tissue WSIs with three commonly used IHC stains: P53, PD-L1, and Ki-67. HistoStainAlign achieved weighted F1 scores of 0.735 [95% Confidence Interval (CI): 0.670-0.799], 0.830 [95% CI: 0.772-0.886], and 0.723 [95% CI: 0.607-0.836], respectively for these three IHC stains. Embedding analyses demonstrated the robustness of the contrastive alignment in capturing meaningful cross-stain relationships. Comparisons with a baseline model further highlight the advantage of incorporating contrastive learning for improved stain pattern prediction. This study demonstrates the potential of computational approaches to serve as a pre-screening tool, helping prioritize cases for IHC staining and improving workflow efficiency.

IVJan 29, 2021
A Petri Dish for Histopathology Image Analysis

Jerry Wei, Arief Suriawinata, Bing Ren et al.

With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens traditionally manually examined under a microscope by pathologists. However, challenges such as limited data, costly annotation, and processing high-resolution and variable-size images make it difficult to quickly iterate over model designs. Throughout scientific history, many significant research directions have leveraged small-scale experimental setups as petri dishes to efficiently evaluate exploratory ideas. In this paper, we introduce a minimalist histopathology image analysis dataset (MHIST), an analogous petri dish for histopathology image analysis. MHIST is a binary classification dataset of 3,152 fixed-size images of colorectal polyps, each with a gold-standard label determined by the majority vote of seven board-certified gastrointestinal pathologists and annotator agreement level. MHIST occupies less than 400 MB of disk space, and a ResNet-18 baseline can be trained to convergence on MHIST in just 6 minutes using 3.5 GB of memory on a NVIDIA RTX 3090. As example use cases, we use MHIST to study natural questions such as how dataset size, network depth, transfer learning, and high-disagreement examples affect model performance. By introducing MHIST, we hope to not only help facilitate the work of current histopathology imaging researchers, but also make the field more-accessible to the general community. Our dataset is available at https://bmirds.github.io/MHIST.

IVOct 30, 2020
Development and Evaluation of a Deep Neural Network for Histologic Classification of Renal Cell Carcinoma on Biopsy and Surgical Resection Slides

Mengdan Zhu, Bing Ren, Ryland Richards et al.

Renal cell carcinoma (RCC) is the most common renal cancer in adults. The histopathologic classification of RCC is essential for diagnosis, prognosis, and management of patients. Reorganization and classification of complex histologic patterns of RCC on biopsy and surgical resection slides under a microscope remains a heavily specialized, error-prone, and time-consuming task for pathologists. In this study, we developed a deep neural network model that can accurately classify digitized surgical resection slides and biopsy slides into five related classes: clear cell RCC, papillary RCC, chromophobe RCC, renal oncocytoma, and normal. In addition to the whole-slide classification pipeline, we visualized the identified indicative regions and features on slides for classification by reprocessing patch-level classification results to ensure the explainability of our diagnostic model. We evaluated our model on independent test sets of 78 surgical resection whole slides and 79 biopsy slides from our tertiary medical institution, and 69 randomly selected surgical resection slides from The Cancer Genome Atlas (TCGA) database. The average area under the curve (AUC) of our classifier on the internal resection slides, internal biopsy slides, and external TCGA slides is 0.98, 0.98 and 0.99, respectively. Our results suggest that the high generalizability of our approach across different data sources and specimen types. More importantly, our model has the potential to assist pathologists by (1) automatically pre-screening slides to reduce false-negative cases, (2) highlighting regions of importance on digitized slides to accelerate diagnosis, and (3) providing objective and accurate diagnosis as the second opinion.

CVSep 29, 2020
Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification

Jerry Wei, Arief Suriawinata, Bing Ren et al.

Applying curriculum learning requires both a range of difficulty in data and a method for determining the difficulty of examples. In many tasks, however, satisfying these requirements can be a formidable challenge. In this paper, we contend that histopathology image classification is a compelling use case for curriculum learning. Based on the nature of histopathology images, a range of difficulty inherently exists among examples, and, since medical datasets are often labeled by multiple annotators, annotator agreement can be used as a natural proxy for the difficulty of a given example. Hence, we propose a simple curriculum learning method that trains on progressively-harder images as determined by annotator agreement. We evaluate our hypothesis on the challenging and clinically-important task of colorectal polyp classification. Whereas vanilla training achieves an AUC of 83.7% for this task, a model trained with our proposed curriculum learning approach achieves an AUC of 88.2%, an improvement of 4.5%. Our work aims to inspire researchers to think more creatively and rigorously when choosing contexts for applying curriculum learning.

CVApr 27, 2020
Difficulty Translation in Histopathology Images

Jerry Wei, Arief Suriawinata, Xiaoying Liu et al.

The unique nature of histopathology images opens the door to domain-specific formulations of image translation models. We propose a difficulty translation model that modifies colorectal histopathology images to be more challenging to classify. Our model comprises a scorer, which provides an output confidence to measure the difficulty of images, and an image translator, which learns to translate images from easy-to-classify to hard-to-classify using a training set defined by the scorer. We present three findings. First, generated images were indeed harder to classify for both human pathologists and machine learning classifiers than their corresponding source images. Second, image classifiers trained with generated images as augmented data performed better on both easy and hard images from an independent test set. Finally, human annotator agreement and our model's measure of difficulty correlated strongly, implying that for future work requiring human annotator agreement, the confidence score of a machine learning classifier could be used as a proxy.

LGNov 22, 2019
Parallel Distributed Logistic Regression for Vertical Federated Learning without Third-Party Coordinator

Shengwen Yang, Bing Ren, Xuhui Zhou et al.

Federated Learning is a new distributed learning mechanism which allows model training on a large corpus of decentralized data owned by different data providers, without sharing or leakage of raw data. According to the characteristics of data dis-tribution, it could be usually classified into three categories: horizontal federated learning, vertical federated learning, and federated transfer learning. In this paper we present a solution for parallel dis-tributed logistic regression for vertical federated learning. As compared with existing works, the role of third-party coordinator is removed in our proposed solution. The system is built on the pa-rameter server architecture and aims to speed up the model training via utilizing a cluster of servers in case of large volume of training data. We also evaluate the performance of the parallel distributed model training and the experimental results show the great scalability of the system.

IVOct 13, 2019
Generative Image Translation for Data Augmentation in Colorectal Histopathology Images

Jerry Wei, Arief Suriawinata, Louis Vaickus et al.

We present an image translation approach to generate augmented data for mitigating data imbalances in a dataset of histopathology images of colorectal polyps, adenomatous tumors that can lead to colorectal cancer if left untreated. By applying cycle-consistent generative adversarial networks (CycleGANs) to a source domain of normal colonic mucosa images, we generate synthetic colorectal polyp images that belong to diagnostically less common polyp classes. Generated images maintain the general structure of their source image but exhibit adenomatous features that can be enhanced with our proposed filtration module, called Path-Rank-Filter. We evaluate the quality of generated images through Turing tests with four gastrointestinal pathologists, finding that at least two of the four pathologists could not identify generated images at a statistically significant level. Finally, we demonstrate that using CycleGAN-generated images to augment training data improves the AUC of a convolutional neural network for detecting sessile serrated adenomas by over 10%, suggesting that our approach might warrant further research for other histopathology image classification tasks.

IVSep 27, 2019
Deep neural networks for automated classification of colorectal polyps on histopathology slides: A multi-institutional evaluation

Jason W. Wei, Arief A. Suriawinata, Louis J. Vaickus et al.

Histological classification of colorectal polyps plays a critical role in both screening for colorectal cancer and care of affected patients. An accurate and automated algorithm for the classification of colorectal polyps on digitized histopathology slides could benefit clinicians and patients. Evaluate the performance and assess the generalizability of a deep neural network for colorectal polyp classification on histopathology slide images using a multi-institutional dataset. In this study, we developed a deep neural network for classification of four major colorectal polyp types, tubular adenoma, tubulovillous/villous adenoma, hyperplastic polyp, and sessile serrated adenoma, based on digitized histopathology slides from our institution, Dartmouth-Hitchcock Medical Center (DHMC), in New Hampshire. We evaluated the deep neural network on an internal dataset of 157 histopathology slide images from DHMC, as well as on an external dataset of 238 histopathology slide images from 24 different institutions spanning 13 states in the United States. We measured accuracy, sensitivity, and specificity of our model in this evaluation and compared its performance to local pathologists' diagnoses at the point-of-care retrieved from corresponding pathology laboratories. For the internal evaluation, the deep neural network had a mean accuracy of 93.5% (95% CI 89.6%-97.4%), compared with local pathologists' accuracy of 91.4% (95% CI 87.0%-95.8%). On the external test set, the deep neural network achieved an accuracy of 87.0% (95% CI 82.7%-91.3%), comparable with local pathologists' accuracy of 86.6% (95% CI 82.3%-90.9%). If confirmed in clinical settings, our model could assist pathologists by improving the diagnostic efficiency, reproducibility, and accuracy of colorectal cancer screenings.

CVJan 31, 2019
Automated detection of celiac disease on duodenal biopsy slides: a deep learning approach

Jason W. Wei, Jerry W. Wei, Christopher R. Jackson et al.

Celiac disease prevalence and diagnosis have increased substantially in recent years. The current gold standard for celiac disease confirmation is visual examination of duodenal mucosal biopsies. An accurate computer-aided biopsy analysis system using deep learning can help pathologists diagnose celiac disease more efficiently. In this study, we trained a deep learning model to detect celiac disease on duodenal biopsy images. Our model uses a state-of-the-art residual convolutional neural network to evaluate patches of duodenal tissue and then aggregates those predictions for whole-slide classification. We tested the model on an independent set of 212 images and evaluated its classification results against reference standards established by pathologists. Our model identified celiac disease, normal tissue, and nonspecific duodenitis with accuracies of 95.3%, 91.0%, and 89.2%, respectively. The area under the receiver operating characteristic curve was greater than 0.95 for all classes. We have developed an automated biopsy analysis system that achieves high performance in detecting celiac disease on biopsy slides. Our system can highlight areas of interest and provide preliminary classification of duodenal biopsies prior to review by pathologists. This technology has great potential for improving the accuracy and efficiency of celiac disease diagnosis.

IVNov 20, 2018
Attention-Based Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides

Naofumi Tomita, Behnaz Abdollahi, Jason Wei et al.

Deep learning-based methods, such as the sliding window approach for cropped-image classification and heuristic aggregation for whole-slide inference, for analyzing histological patterns in high-resolution microscopy images have shown promising results. These approaches, however, require a laborious annotation process and are fragmented. This diagnostic study collected deidentified high-resolution histological images (N = 379) for training a new model composed of a convolutional neural network and a grid-based attention network, trainable without region-of-interest annotations. Histological images of patients who underwent endoscopic esophagus and gastroesophageal junction mucosal biopsy between January 1, 2016, and December 31, 2018, at Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire) were collected. The method achieved a mean accuracy of 0.83 in classifying 123 test images. These results were comparable with or better than the performance from the current state-of-the-art sliding window approach, which was trained with regions of interest. Results of this study suggest that the proposed attention-based deep neural network framework for Barrett esophagus and esophageal adenocarcinoma detection is important because it is based solely on tissue-level annotations, unlike existing methods that are based on regions of interest. This new model is expected to open avenues for applying deep learning to digital pathology.