Lily Peng

CV
18papers
3,453citations
Novelty46%
AI Score43

18 Papers

CVMay 19, 2022
Robust and Efficient Medical Imaging with Self-Supervision

Shekoofeh Azizi, Laura Culp, Jan Freyberg et al.

Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.

LGMar 18, 2022
AI system for fetal ultrasound in low-resource settings

Ryan G. Gomes, Bellington Vwalika, Chace Lee et al.

Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to estimate gestational age (GA) and fetal malpresentation. We further addressed obstacles that may be encountered in low-resourced settings. Using a simplified sweep protocol with real-time AI feedback on sweep quality, we have demonstrated the generalization of model performance to minimally trained novice ultrasound operators using low cost ultrasound devices with on-device AI integration. The GA model was non-inferior to standard fetal biometry estimates with as few as two sweeps, and the fetal malpresentation model had high AUC-ROCs across operators and devices. Our AI models have the potential to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.

IVJul 19, 2022
Discovering novel systemic biomarkers in photos of the external eye

Boris Babenko, Ilana Traynis, Christina Chen et al.

External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidney (eGFR estimated using the race-free 2021 CKD-EPI creatinine equation, the urine ACR); bone & mineral (calcium); thyroid (TSH); and blood count (Hgb, WBC, platelets). Development leveraged 151,237 images from 49,015 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA. Evaluation focused on 9 pre-specified systemic parameters and leveraged 3 validation sets (A, B, C) spanning 28,869 patients with and without diabetes undergoing eye screening in 3 independent sites in Los Angeles County, CA, and the greater Atlanta area, GA. We compared against baseline models incorporating available clinicodemographic variables (e.g. age, sex, race/ethnicity, years with diabetes). Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST>36, calcium<8.6, eGFR<60, Hgb<11, platelets<150, ACR>=300, and WBC<4 on validation set A (a patient population similar to the development sets), where the AUC of DLS exceeded that of the baseline by 5.2-19.4%. On validation sets B and C, with substantial patient population differences compared to the development sets, the DLS outperformed the baseline for ACR>=300 and Hgb<11 by 7.3-13.2%. Our findings provide further evidence that external eye photos contain important biomarkers of systemic health spanning multiple organ systems. Further work is needed to investigate whether and how these biomarkers can be translated into clinical impact.

47.9AIMar 18
The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Alvin Rajkomar, Pavan Sudarshan, Angela Lai et al.

Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.

IVMay 16, 2021
Deep learning for detecting pulmonary tuberculosis via chest radiography: an international study across 10 countries

Sahar Kazemzadeh, Jin Yu, Shahar Jamshy et al.

Tuberculosis (TB) is a top-10 cause of death worldwide. Though the WHO recommends chest radiographs (CXRs) for TB screening, the limited availability of CXR interpretation is a barrier. We trained a deep learning system (DLS) to detect active pulmonary TB using CXRs from 9 countries across Africa, Asia, and Europe, and utilized large-scale CXR pretraining, attention pooling, and noisy student semi-supervised learning. Evaluation was on (1) a combined test set spanning China, India, US, and Zambia, and (2) an independent mining population in South Africa. Given WHO targets of 90% sensitivity and 70% specificity, the DLS's operating point was prespecified to favor sensitivity over specificity. On the combined test set, the DLS's ROC curve was above all 9 India-based radiologists, with an AUC of 0.90 (95%CI 0.87-0.92). The DLS's sensitivity (88%) was higher than the India-based radiologists (75% mean sensitivity), p<0.001 for superiority; and its specificity (79%) was non-inferior to the radiologists (84% mean specificity), p=0.004. Similar trends were observed within HIV positive and sputum smear positive sub-groups, and in the South Africa test set. We found that 5 US-based radiologists (where TB isn't endemic) were more sensitive and less specific than the India-based radiologists (where TB is endemic). The DLS also remained non-inferior to the US-based radiologists. In simulations, using the DLS as a prioritization tool for confirmatory testing reduced the cost per positive case detected by 40-80% compared to using confirmatory testing alone. To conclude, our DLS generalized to 5 countries, and merits prospective evaluation to assist cost-effective screening efforts in radiologist-limited settings. Operating point flexibility may permit customization of the DLS to account for site-specific factors such as TB prevalence, demographics, clinical resources, and customary practice patterns.

IVNov 23, 2020
Detecting hidden signs of diabetes in external eye photographs

Boris Babenko, Akinori Mitani, Ilana Traynis et al.

Diabetes-related retinal conditions can be detected by examining the posterior of the eye. By contrast, examining the anterior of the eye can reveal conditions affecting the front of the eye, such as changes to the eyelids, cornea, or crystalline lens. In this work, we studied whether external photographs of the front of the eye can reveal insights into both diabetic retinal diseases and blood glucose control. We developed a deep learning system (DLS) using external eye photographs of 145,832 patients with diabetes from 301 diabetic retinopathy (DR) screening sites in one US state, and evaluated the DLS on three validation sets containing images from 198 sites in 18 other US states. In validation set A (n=27,415 patients, all undilated), the DLS detected poor blood glucose control (HbA1c > 9%) with an area under receiver operating characteristic curve (AUC) of 70.2; moderate-or-worse DR with an AUC of 75.3; diabetic macular edema with an AUC of 78.0; and vision-threatening DR with an AUC of 79.4. For all 4 prediction tasks, the DLS's AUC was higher (p<0.001) than using available self-reported baseline characteristics (age, sex, race/ethnicity, years with diabetes). In terms of positive predictive value, the predicted top 5% of patients had a 67% chance of having HbA1c > 9%, and a 20% chance of having vision threatening diabetic retinopathy. The results generalized to dilated pupils (validation set B, 5,058 patients) and to a different screening service (validation set C, 10,402 patients). Our results indicate that external eye photographs contain information useful for healthcare providers managing patients with diabetes, and may help prioritize patients for in-person screening. Further work is needed to validate these findings on different devices and patient populations (those without diabetes) to evaluate its utility for remote diagnosis and management.

IVOct 22, 2020
Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Unseen Diseases

Zaid Nabulsi, Andrew Sellergren, Shahar Jamshy et al.

Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to build specific systems to detect every possible condition. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For development, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system generalizes to new patient populations and abnormalities. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist.

IVAug 10, 2020
Predicting Risk of Developing Diabetic Retinopathy using Deep Learning

Ashish Bora, Siva Balasubramanian, Boris Babenko et al.

Diabetic retinopathy (DR) screening is instrumental in preventing blindness, but faces a scaling challenge as the number of diabetic patients rises. Risk stratification for the development of DR may help optimize screening intervals to reduce costs while improving vision-related outcomes. We created and validated two versions of a deep learning system (DLS) to predict the development of mild-or-worse ("Mild+") DR in diabetic patients undergoing DR screening. The two versions used either three-fields or a single field of color fundus photographs (CFPs) as input. The training set was derived from 575,431 eyes, of which 28,899 had known 2-year outcome, and the remaining were used to augment the training process via multi-task learning. Validation was performed on both an internal validation set (set A; 7,976 eyes; 3,678 with known outcome) and an external validation set (set B; 4,762 eyes; 2,345 with known outcome). For predicting 2-year development of DR, the 3-field DLS had an area under the receiver operating characteristic curve (AUC) of 0.79 (95%CI, 0.78-0.81) on validation set A. On validation set B (which contained only a single field), the 1-field DLS's AUC was 0.70 (95%CI, 0.67-0.74). The DLS was prognostic even after adjusting for available risk factors (p<0.001). When added to the risk factors, the 3-field DLS improved the AUC from 0.72 (95%CI, 0.68-0.76) to 0.81 (95%CI, 0.77-0.84) in validation set A, and the 1-field DLS improved the AUC from 0.62 (95%CI, 0.58-0.66) to 0.71 (95%CI, 0.68-0.75) in validation set B. The DLSs in this study identified prognostic information for DR development from CFPs. This information is independent of and more informative than the available risk factors.

CVJul 10, 2020
Scientific Discovery by Generating Counterfactuals using Image Translation

Arunachalam Narayanaswamy, Subhashini Venugopalan, Dale R. Webster et al.

Model explanation techniques play a critical role in understanding the source of a model's performance and making its decisions transparent. Here we investigate if explanation techniques can also be used as a mechanism for scientific discovery. We make three contributions: first, we propose a framework to convert predictions from explanation techniques to a mechanism of discovery. Second, we show how generative models in combination with black-box predictors can be used to generate hypotheses (without human priors) that can be critically examined. Third, with these techniques we study classification models for retinal images predicting Diabetic Macular Edema (DME), where recent work showed that a CNN trained on these images is likely learning novel features in the image. We demonstrate that the proposed framework is able to explain the underlying scientific mechanism, thus bridging the gap between the model's performance and human understanding.

CVApr 12, 2019
Detecting Anemia from Retinal Fundus Images

Akinori Mitani, Yun Liu, Abigail Huang et al.

Despite its high prevalence, anemia is often undetected due to the invasiveness and cost of screening and diagnostic tests. Though some non-invasive approaches have been developed, they are less accurate than invasive methods, resulting in an unmet need for more accurate non-invasive methods. Here, we show that deep learning-based algorithms can detect anemia and quantify several related blood measurements using retinal fundus images both in isolation and in combination with basic metadata such as patient demographics. On a validation dataset of 11,388 patients from the UK Biobank, our algorithms achieved a mean absolute error of 0.63 g/dL (95% confidence interval (CI) 0.62-0.64) in quantifying hemoglobin concentration and an area under receiver operating characteristic curve (AUC) of 0.88 (95% CI 0.86-0.89) in detecting anemia. This work shows the potential of automated non-invasive anemia screening based on fundus images, particularly in diabetic patients, who may have regular retinal imaging and are at increased risk of further morbidity and mortality from anemia.

CVApr 10, 2019
Predicting Progression of Age-related Macular Degeneration from Fundus Images using Deep Learning

Boris Babenko, Siva Balasubramanian, Katy E. Blumer et al.

Background: Patients with neovascular age-related macular degeneration (AMD) can avoid vision loss via certain therapy. However, methods to predict the progression to neovascular age-related macular degeneration (nvAMD) are lacking. Purpose: To develop and validate a deep learning (DL) algorithm to predict 1-year progression of eyes with no, early, or intermediate AMD to nvAMD, using color fundus photographs (CFP). Design: Development and validation of a DL algorithm. Methods: We trained a DL algorithm to predict 1-year progression to nvAMD, and used 10-fold cross-validation to evaluate this approach on two groups of eyes in the Age-Related Eye Disease Study (AREDS): none/early/intermediate AMD, and intermediate AMD (iAMD) only. We compared the DL algorithm to the manually graded 4-category and 9-step scales in the AREDS dataset. Main outcome measures: Performance of the DL algorithm was evaluated using the sensitivity at 80% specificity for progression to nvAMD. Results: The DL algorithm's sensitivity for predicting progression to nvAMD from none/early/iAMD (78+/-6%) was higher than manual grades from the 9-step scale (67+/-8%) or the 4-category scale (48+/-3%). For predicting progression specifically from iAMD, the DL algorithm's sensitivity (57+/-6%) was also higher compared to the 9-step grades (36+/-8%) and the 4-category grades (20+/-0%). Conclusions: Our DL algorithm performed better in predicting progression to nvAMD than manual grades. Future investigations are required to test the application of this DL algorithm in a real-world clinical setting.

CVDec 21, 2018
Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photos

Sonia Phene, R. Carter Dunn, Naama Hammel et al.

Glaucoma is the leading cause of preventable, irreversible blindness world-wide. The disease can remain asymptomatic until severe, and an estimated 50%-90% of people with glaucoma remain undiagnosed. Glaucoma screening is recommended for early detection and treatment. A cost-effective tool to detect glaucoma could expand screening access to a much larger patient population, but such a tool is currently unavailable. We trained a deep learning algorithm using a retrospective dataset of 86,618 images, assessed for glaucomatous optic nerve head features and referable glaucomatous optic neuropathy (GON). The algorithm was validated using 3 datasets. For referable GON, the algorithm had an AUC of 0.945 (95% CI, 0.929-0.960) in dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of Glaucoma Specialists (GSs); 0.855 (95% CI, 0.841-0.870) in dataset B (9642 images, 1 image/patient; 9.2% referable), images from Atlanta Veterans Affairs Eye Clinic diabetic teleretinal screening program; and 0.881 (95% CI, 0.838-0.918) in dataset C (346 images, 1 image/patient; 81.7% referable), images from Dr. Shroff's Charity Eye Hospital's glaucoma clinic. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders, while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. An algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.

CVOct 18, 2018
Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning

Avinash Varadarajan, Pinal Bavishi, Paisan Raumviboonsuk et al.

Diabetic eye disease is one of the fastest growing causes of preventable blindness. With the advent of anti-VEGF (vascular endothelial growth factor) therapies, it has become increasingly important to detect center-involved diabetic macular edema (ci-DME). However, center-involved diabetic macular edema is diagnosed using optical coherence tomography (OCT), which is not generally available at screening sites because of cost and workflow constraints. Instead, screening programs rely on the detection of hard exudates in color fundus photographs as a proxy for DME, often resulting in high false positive or false negative calls. To improve the accuracy of DME screening, we trained a deep learning model to use color fundus photographs to predict ci-DME. Our model had an ROC-AUC of 0.89 (95% CI: 0.87-0.91), which corresponds to a sensitivity of 85% at a specificity of 80%. In comparison, three retinal specialists had similar sensitivities (82-85%), but only half the specificity (45-50%, p<0.001 for each comparison with model). The positive predictive value (PPV) of the model was 61% (95% CI: 56-66%), approximately double the 36-38% by the retinal specialists. In addition to predicting ci-DME, our model was able to detect the presence of intraretinal fluid with an AUC of 0.81 (95% CI: 0.81-0.86) and subretinal fluid with an AUC of 0.88 (95% CI: 0.85-0.91). The ability of deep learning algorithms to make clinically relevant predictions that generally require sophisticated 3D-imaging equipment from simple 2D images has broad relevance to many other applications in medical imaging.

CVOct 18, 2018
Deep Learning vs. Human Graders for Classifying Severity Levels of Diabetic Retinopathy in a Real-World Nationwide Screening Program

Paisan Raumviboonsuk, Jonathan Krause, Peranut Chotcomwongse et al.

Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.

CVDec 21, 2017
Deep learning for predicting refractive error from retinal fundus images

Avinash V. Varadarajan, Ryan Poplin, Katy Blumer et al.

Refractive error, one of the leading cause of visual impairment, can be corrected by simple interventions like prescribing eyeglasses. We trained a deep learning algorithm to predict refractive error from the fundus photographs from participants in the UK Biobank cohort, which were 45 degree field of view images and the AREDS clinical trial, which contained 30 degree field of view images. Our model use the "attention" method to identify features that are correlated with refractive error. Mean absolute error (MAE) of the algorithm's prediction compared to the refractive error obtained in the AREDS and UK Biobank. The resulting algorithm had a MAE of 0.56 diopters (95% CI: 0.55-0.56) for estimating spherical equivalent on the UK Biobank dataset and 0.91 diopters (95% CI: 0.89-0.92) for the AREDS dataset. The baseline expected MAE (obtained by simply predicting the mean of this population) was 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggested that the foveal region was one of the most important areas used by the algorithm to make this prediction, though other regions also contribute to the prediction. The ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images. Given that several groups have recently shown that it is feasible to obtain retinal fundus photos using mobile phones and inexpensive attachments, this work may be particularly relevant in regions of the world where autorefractors may not be readily available.

CVOct 4, 2017
Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy

Jonathan Krause, Varun Gulshan, Ehsan Rahimy et al.

Diabetic retinopathy (DR) and diabetic macular edema are common complications of diabetes which can lead to vision loss. The grading of DR is a fairly complex process that requires the detection of fine features such as microaneurysms, intraretinal hemorrhages, and intraretinal microvascular abnormalities. Because of this, there can be a fair amount of grader variability. There are different methods of obtaining the reference standard and resolving disagreements between graders, and while it is usually accepted that adjudication until full consensus will yield the best reference standard, the difference between various methods of resolving disagreements has not been examined extensively. In this study, we examine the variability in different methods of grading, definitions of reference standards, and their effects on building deep learning models for the detection of diabetic eye disease. We find that a small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. board-certified ophthalmologists and retinal specialists.

CVAug 31, 2017
Predicting Cardiovascular Risk Factors from Retinal Fundus Photographs using Deep Learning

Ryan Poplin, Avinash V. Varadarajan, Katy Blumer et al.

Traditionally, medical discoveries are made by observing associations and then designing experiments to test these hypotheses. However, observing and quantifying associations in images can be difficult because of the wide variety of features, patterns, colors, values, shapes in real data. In this paper, we use deep learning, a machine learning technique that learns its own features, to discover new knowledge from retinal fundus images. Using models trained on data from 284,335 patients, and validated on two independent datasets of 12,026 and 999 patients, we predict cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as such as age (within 3.26 years), gender (0.97 AUC), smoking status (0.71 AUC), HbA1c (within 1.39%), systolic blood pressure (within 11.23mmHg) as well as major adverse cardiac events (0.70 AUC). We further show that our models used distinct aspects of the anatomy to generate each prediction, such as the optic disc or blood vessels, opening avenues of further research.

CVMar 3, 2017
Detecting Cancer Metastases on Gigapixel Pathology Images

Yun Liu, Krishna Gadepalli, Mohammad Norouzi et al.

Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x 100 pixels in gigapixel microscopy images sized 100,000 x 100,000 pixels. Our method leverages a convolutional neural network (CNN) architecture and obtains state-of-the-art results on the Camelyon16 dataset in the challenging lesion-level tumor detection task. At 8 false positives per image, we detect 92.4% of the tumors, relative to 82.7% by the previous best automated approach. For comparison, a human pathologist attempting exhaustive search achieved 73.2% sensitivity. We achieve image-level AUC scores above 97% on both the Camelyon16 test set and an independent set of 110 slides. In addition, we discover that two slides in the Camelyon16 training set were erroneously labeled normal. Our approach could considerably reduce false negative rates in metastasis detection.