Cynthia Zeng

h-index6

7papers

476citations

Novelty49%

AI Score29

Ranked #151,726 of 201,326 authors (top 75%)#33,399 in LG (top 78%)

7 Papers

LGJan 29, 2023

Global Flood Prediction: a Multimodal Machine Learning Approach

Cynthia Zeng, Dimitris Bertsimas

Flooding is one of the most destructive and costly natural disasters, and climate changes would further increase risks globally. This work presents a novel multimodal machine learning approach for multi-year global flood risk prediction, combining geographical information and historical natural disaster dataset. Our multimodal framework employs state-of-the-art processing techniques to extract embeddings from each data modality, including text-based geographical data and tabular-based time-series data. Experiments demonstrate that a multimodal approach, that is combining text and statistical data, outperforms a single-modality approach. Our most advanced architecture, employing embeddings extracted using transfer learning upon DistilBert model, achieves 75\%-77\% ROCAUC score in predicting the next 1-5 year flooding event in historically flooded locations. This work demonstrates the potentials of using machine learning for long-term planning in natural disaster management.

LGJun 21, 2022

TabText: Language-Based Representations of Tabular Health Data for Predictive Modelling

Kimberly Villalobos Carballo, Liangyuan Na, Yu Ma et al.

Tabular medical records remain the most readily available data format for applying machine learning in healthcare. However, traditional data preprocessing ignores valuable contextual information in tables and requires substantial manual cleaning and harmonisation, creating a bottleneck for model development. We introduce TabText, a preprocessing and feature extraction method that leverages contextual information and streamlines the curation of tabular medical data. This method converts tables into contextual language and applies pretrained large language models (LLMs) to generate task-independent numerical representations. These fixed embeddings are then used as input for various predictive tasks. TabText was evaluated on nine inpatient flow prediction tasks (e.g., ICU admission, discharge, mortality) using electronic medical records across six hospitals from a US health system, and on nine publicly available datasets from the UCI Machine Learning Repository, covering tasks such as cancer diagnosis, recurrence, and survival. TabText models trained on unprocessed data from a single hospital (572,964 patient-days, Jan 2018-Dec 2020) achieved accurate performance (AUC 0.75-0.94) when tested prospectively on 265,917 patient-days from Jan 2021-Apr 2022, and generalised well to five additional hospitals not used for training. When augmenting preprocessed tabular records with these contextual embeddings, out-of-sample AUC improved by up to 4 additive percentage points in challenging tasks such as ICU transfer and breast cancer recurrence, while providing little to no benefit for already high-performing tasks. Findings were consistent across both private and public datasets.

LGMar 22, 2023

Reducing Air Pollution through Machine Learning

Dimitris Bertsimas, Leonard Boussioux, Cynthia Zeng

This paper presents a data-driven approach to mitigate the effects of air pollution from industrial plants on nearby cities by linking operational decisions with weather conditions. Our method combines predictive and prescriptive machine learning models to forecast short-term wind speed and direction and recommend operational decisions to reduce or pause the industrial plant's production. We exhibit several trade-offs between reducing environmental impact and maintaining production activities. The predictive component of our framework employs various machine learning models, such as gradient-boosted tree-based models and ensemble methods, for time series forecasting. The prescriptive component utilizes interpretable optimal policy trees to propose multiple trade-offs, such as reducing dangerous emissions by 33-47% and unnecessary costs by 40-63%. Our deployed models significantly reduced forecasting errors, with a range of 38-52% for less than 12-hour lead time and 14-46% for 12 to 48-hour lead time compared to official weather forecasts. We have successfully implemented the predictive component at the OCP Safi site, which is Morocco's largest chemical industrial plant, and are currently in the process of deploying the prescriptive component. Our framework enables sustainable industrial development by eliminating the pollution-industrial activity trade-off through data-driven weather-based operational decisions, significantly enhancing factory optimization and sustainability. This modernizes factory planning and resource allocation while maintaining environmental compliance. The predictive component has boosted production efficiency, leading to cost savings and reduced environmental impact by minimizing air pollution.

OCMay 11, 2024

Catastrophe Insurance: An Adaptive Robust Optimization Approach

Dimitris Bertsimas, Cynthia Zeng

The escalating frequency and severity of natural disasters, exacerbated by climate change, underscore the critical role of insurance in facilitating recovery and promoting investments in risk reduction. This work introduces a novel Adaptive Robust Optimization (ARO) framework tailored for the calculation of catastrophe insurance premiums, with a case study applied to the United States National Flood Insurance Program (NFIP). To the best of our knowledge, it is the first time an ARO approach has been applied to for disaster insurance pricing. Our methodology is designed to protect against both historical and emerging risks, the latter predicted by machine learning models, thus directly incorporating amplified risks induced by climate change. Using the US flood insurance data as a case study, optimization models demonstrate effectiveness in covering losses and produce surpluses, with a smooth balance transition through parameter fine-tuning. Among tested optimization models, results show ARO models with conservative parameter values achieving low number of insolvent states with the least insurance premium charged. Overall, optimization frameworks offer versatility and generalizability, making it adaptable to a variety of natural disaster scenarios, such as wildfires, droughts, etc. This work not only advances the field of insurance premium modeling but also serves as a vital tool for policymakers and stakeholders in building resilience to the growing risks of natural catastrophes.

LGFeb 25, 2022

Integrated multimodal artificial intelligence framework for healthcare applications

Luis R. Soenksen, Yu Ma, Cynthia Zeng et al.

Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on HAIM-MIMIC-MM, a multimodal clinical database (N=34,537 samples) containing 7,279 unique hospitalizations and 6,485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text, and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6-33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48-hour mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data modality importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.

LGNov 11, 2020

Hurricane Forecasting: A Novel Multimodal Machine Learning Framework

Léonard Boussioux, Cynthia Zeng, Théo Guénais et al.

This paper describes a novel machine learning (ML) framework for tropical cyclone intensity and track forecasting, combining multiple ML techniques and utilizing diverse data sources. Our multimodal framework, called Hurricast, efficiently combines spatial-temporal data with statistical data by extracting features with deep-learning encoder-decoder architectures and predicting with gradient-boosted trees. We evaluate our models in the North Atlantic and Eastern Pacific basins on 2016-2019 for 24-hour lead time track and intensity forecasts and show they achieve comparable mean absolute error and skill to current operational forecast models while computing in seconds. Furthermore, the inclusion of Hurricast into an operational forecast consensus model could improve over the National Hurricane Center's official forecast, thus highlighting the complementary properties with existing approaches. In summary, our work demonstrates that utilizing machine learning techniques to combine different data sources can lead to new opportunities in tropical cyclone forecasting.

APJun 30, 2020

From predictions to prescriptions: A data-driven response to COVID-19

Dimitris Bertsimas, Léonard Boussioux, Ryan Cory Wright et al.

The COVID-19 pandemic has created unprecedented challenges worldwide. Strained healthcare providers make difficult decisions on patient triage, treatment and care management on a daily basis. Policy makers have imposed social distancing measures to slow the disease, at a steep economic price. We design analytical tools to support these decisions and combat the pandemic. Specifically, we propose a comprehensive data-driven approach to understand the clinical characteristics of COVID-19, predict its mortality, forecast its evolution, and ultimately alleviate its impact. By leveraging cohort-level clinical data, patient-level hospital data, and census-level epidemiological data, we develop an integrated four-step approach, combining descriptive, predictive and prescriptive analytics. First, we aggregate hundreds of clinical studies into the most comprehensive database on COVID-19 to paint a new macroscopic picture of the disease. Second, we build personalized calculators to predict the risk of infection and mortality as a function of demographics, symptoms, comorbidities, and lab values. Third, we develop a novel epidemiological model to project the pandemic's spread and inform social distancing policies. Fourth, we propose an optimization model to re-allocate ventilators and alleviate shortages. Our results have been used at the clinical level by several hospitals to triage patients, guide care management, plan ICU capacity, and re-distribute ventilators. At the policy level, they are currently supporting safe back-to-work policies at a major institution and equitable vaccine distribution planning at a major pharmaceutical company, and have been integrated into the US Center for Disease Control's pandemic forecast.