Michael Lingzhi Li

h-index21

17papers

593citations

Novelty53%

AI Score40

Ranked #70,077 of 194,257 authors (top 36%)#15,734 in LG (top 39%)

17 Papers

15.9CLJun 5, 2023Code

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Junling Liu, Peilin Zhou, Yining Hua et al.

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

4.3MEOct 15, 2022

Distributionally Robust Causal Inference with Observational Data

Dimitris Bertsimas, Kosuke Imai, Michael Lingzhi Li

We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.

6.6LGAug 28, 2023

NAS-X: Neural Adaptive Smoothing via Twisting

Dieterich Lawson, Michael Li, Scott Linderman

Sequential latent variable models (SLVMs) are essential tools in statistics and machine learning, with applications ranging from healthcare to neuroscience. As their flexibility increases, analytic inference and model learning can become challenging, necessitating approximate methods. Here we introduce neural adaptive smoothing via twisting (NAS-X), a method that extends reweighted wake-sleep (RWS) to the sequential setting by using smoothing sequential Monte Carlo (SMC) to estimate intractable posterior expectations. Combining RWS and smoothing SMC allows NAS-X to provide low-bias and low-variance gradient estimates, and fit both discrete and continuous latent variable models. We illustrate the theoretical advantages of NAS-X over previous methods and explore these advantages empirically in a variety of tasks, including a challenging application to mechanistic models of neuronal dynamics. These experiments show that NAS-X substantially outperforms previous VI- and RWS-based methods in inference and model learning, achieving lower parameter error and tighter likelihood bounds.

6.4LGSep 17, 2024

Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation

Michael Lingzhi Li, Shixiang Zhu

Operational decisions in healthcare, logistics, and public policy increasingly involve algorithms that recommend candidate solutions, such as treatment plans, delivery routes, or policy options, while leaving the final choice to human decision-makers. For instance, school districts use algorithms to design bus routes, but administrators make the final call given community feedback. In these settings, decision quality depends not on a single algorithmic ``optimum'', but on whether the portfolio of recommendations contains at least one option the human ultimately deems desirable. We propose generative curation, a framework that optimally generates recommendation sets when desirability depends on both observable objectives and unobserved qualitative considerations. Instead of a fixed solution, generative curation learns a distribution over solutions designed to maximize the expected desirability of the best option within a manageable portfolio. Our analysis identifies a trade-off between quantitative quality and qualitative diversity, formalized through a novel diversity metric derived from the reformulated objective. We implement the framework using a generative neural network and a sequential optimization method, and show in synthetic and real-world studies that it consistently reduces expected regret compared to existing benchmarks. Our framework provides decision-makers with a principled way to design algorithms that complement, rather than replace, human judgment. By generating portfolios of diverse yet high-quality options, decision-support tools can better accommodate unmodeled factors such as stakeholder preferences, political feasibility, or community acceptance. More broadly, the framework enables organizations to operationalize human-centered decision-making at scale, ensuring that algorithmic recommendations remain useful even when objectives are incomplete or evolving.

1.6LGOct 29, 2021Code

Holistic Deep Learning

Dimitris Bertsimas, Kimberly Villalobos Carballo, Léonard Boussioux et al.

This paper presents a novel holistic deep learning framework that simultaneously addresses the challenges of vulnerability to input perturbations, overparametrization, and performance instability from different train-validation splits. The proposed framework holistically improves accuracy, robustness, sparsity, and stability over standard deep learning models, as demonstrated by extensive experiments on both tabular and image data sets. The results are further validated by ablation experiments and SHAP value analysis, which reveal the interactions and trade-offs between the different evaluation metrics. To support practitioners applying our framework, we provide a prescriptive approach that offers recommendations for selecting an appropriate training loss function based on their specific objectives. All the code to reproduce the results can be found at https://github.com/kimvc7/HDL.

1.2MEApr 25, 2024Code

Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules

Michael Lingzhi Li, Kosuke Imai

A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman's approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman's repeated sampling framework is as relevant for causal inference today as it has been since its inception.

4.1LGSep 10, 2025

The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data

Xiaolong Luo, Michael Lingzhi Li

While existing critical care EHR datasets such as MIMIC and eICU have enabled significant advances in clinical AI research, the CRITICAL dataset opens new frontiers by providing extensive scale and diversity -- containing 1.95 billion records from 371,365 patients across four geographically diverse CTSA institutions. CRITICAL's unique strength lies in capturing full-spectrum patient journeys, including pre-ICU, ICU, and post-ICU encounters across both inpatient and outpatient settings. This multi-institutional, longitudinal perspective creates transformative opportunities for developing generalizable predictive models and advancing health equity research. However, the richness of this multi-site resource introduces substantial complexity in data harmonization, with heterogeneous collection practices and diverse vocabulary usage patterns requiring sophisticated preprocessing approaches. We present CRISP to unlock the full potential of this valuable resource. CRISP systematically transforms raw Observational Medical Outcomes Partnership Common Data Model data into ML-ready datasets through: (1) transparent data quality management with comprehensive audit trails, (2) cross-vocabulary mapping of heterogeneous medical terminologies to unified SNOMED-CT standards, with deduplication and unit standardization, (3) modular architecture with parallel optimization enabling complete dataset processing in $<$1 day even on standard computing hardware, and (4) comprehensive baseline model benchmarks spanning multiple clinical prediction tasks to establish reproducible performance standards. By providing processing pipeline, baseline implementations, and detailed transformation documentation, CRISP saves researchers months of preprocessing effort and democratizes access to large-scale multi-institutional critical care data, enabling them to focus on advancing clinical AI.

2.6LGJun 20, 2024

Learning to Cover: Online Learning and Optimization with Irreversible Decisions

Alexandre Jacquillat, Michael Lingzhi Li

We define an online learning and optimization problem with discrete and irreversible decisions contributing toward a coverage target. In each period, a decision-maker selects facilities to open, receives information on the success of each one, and updates a classification model to guide future decisions. The goal is to minimize facility openings under a chance constraint reflecting the coverage target, in an asymptotic regime characterized by a large target number of facilities $m\to\infty$ but a finite horizon $T \in \mathcal{Z}_+$. We prove that, under statistical conditions, the online classifier converges to the Bayes-optimal classifier at a rate of at best $\mathcal{O}(1/\sqrt n)$. Thus, we formulate our online learning and optimization problem, with a generalized learning rate $r>0$ and a residual error $1-p$. We derive an asymptotically optimal algorithm and an asymptotically tight lower bound. The regret grows in $Θ\left(m^{\frac{1-r}{1-r^T}}\right)$ if $p=1$ (perfect learning) or in $Θ\left(\max\left\{m^{\frac{1-r}{1-r^T}},\sqrt{m}\right\}\right)$ otherwise; in particular, the regret rate is sub-linear and converges exponentially fast to its infinite-horizon limit. We extend this result to a more complicated facility location setting in a bipartite facility-customer graph with a target on customer coverage. Throughout, constructive proofs identify a policy featuring limited exploration initially and fast exploitation later on once uncertainty gets mitigated. These results uncover the benefits of limited online learning and optimization through pilot programs prior to full-fledged expansion.

2.6LGMar 11, 2024

Cramming Contextual Bandits for On-policy Statistical Evaluation

Zeyang Jia, Kosuke Imai, Michael Lingzhi Li

We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.

26.0LGFeb 25, 2022Code

Integrated multimodal artificial intelligence framework for healthcare applications

Luis R. Soenksen, Yu Ma, Cynthia Zeng et al.

Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on HAIM-MIMIC-MM, a multimodal clinical database (N=34,537 samples) containing 7,279 unique hospitalizations and 6,485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text, and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6-33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48-hour mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data modality importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.

2.2OCMar 3, 2021

Stochastic Cutting Planes for Data-Driven Optimization

Dimitris Bertsimas, Michael Lingzhi Li

We introduce a stochastic version of the cutting-plane method for a large class of data-driven Mixed-Integer Nonlinear Optimization (MINLO) problems. We show that under very weak assumptions the stochastic algorithm is able to converge to an $ε$-optimal solution with high probability. Numerical experiments on several problems show that stochastic cutting planes is able to deliver a multiple order-of-magnitude speedup compared to the standard cutting-plane method. We further experimentally explore the lower limits of sampling for stochastic cutting planes and show that for many problems, a sampling size of $O(\sqrt[3]{n})$ appears to be sufficient for high quality solutions.

6.6APJun 30, 2020

From predictions to prescriptions: A data-driven response to COVID-19

Dimitris Bertsimas, Léonard Boussioux, Ryan Cory Wright et al.

The COVID-19 pandemic has created unprecedented challenges worldwide. Strained healthcare providers make difficult decisions on patient triage, treatment and care management on a daily basis. Policy makers have imposed social distancing measures to slow the disease, at a steep economic price. We design analytical tools to support these decisions and combat the pandemic. Specifically, we propose a comprehensive data-driven approach to understand the clinical characteristics of COVID-19, predict its mortality, forecast its evolution, and ultimately alleviate its impact. By leveraging cohort-level clinical data, patient-level hospital data, and census-level epidemiological data, we develop an integrated four-step approach, combining descriptive, predictive and prescriptive analytics. First, we aggregate hundreds of clinical studies into the most comprehensive database on COVID-19 to paint a new macroscopic picture of the disease. Second, we build personalized calculators to predict the risk of infection and mortality as a function of demographics, symptoms, comorbidities, and lab values. Third, we develop a novel epidemiological model to project the pandemic's spread and inform social distancing policies. Fourth, we propose an optimization model to re-allocate ventilators and alleviate shortages. Our results have been used at the clinical level by several hospitals to triage patients, guide care management, plan ICU capacity, and re-distribute ventilators. At the policy level, they are currently supporting safe back-to-work policies at a major institution and equitable vaccine distribution planning at a major pharmaceutical company, and have been integrated into the US Center for Disease Control's pandemic forecast.

5.4LGNov 13, 2019

A Hierarchy of Graph Neural Networks Based on Learnable Local Features

Michael Lingzhi Li, Meng Dong, Jiawei Zhou et al.

Graph neural networks (GNNs) are a powerful tool to learn representations on graphs by iteratively aggregating features from node neighbourhoods. Many variant models have been proposed, but there is limited understanding on both how to compare different architectures and how to construct GNNs systematically. Here, we propose a hierarchy of GNNs based on their aggregation regions. We derive theoretical results about the discriminative power and feature representation capabilities of each class. Then, we show how this framework can be utilized to systematically construct arbitrarily powerful GNNs. As an example, we construct a simple architecture that exceeds the expressiveness of the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theory on both synthetic and real-world benchmarks, and demonstrate our example's theoretical power translates to strong results on node classification, graph classification, and graph regression tasks.

4.8LGOct 21, 2019

Fast Exact Matrix Completion: A Unified Optimization Framework for Matrix Completion

Dimitris Bertsimas, Michael Lingzhi Li

We formulate the problem of matrix completion with and without side information as a non-convex optimization problem. We design fastImpute based on non-convex gradient descent and show it converges to a global minimum that is guaranteed to recover closely the underlying matrix while it scales to matrices of sizes beyond $10^5 \times 10^5$. We report experiments on both synthetic and real-world datasets that show fastImpute is competitive in both the accuracy of the matrix recovered and the time needed across all cases. Furthermore, when a high number of entries are missing, fastImpute is over $75\%$ lower in MAPE and $15$ times faster than current state-of-the-art matrix completion methods in both the case with side information and without.

1.0LGMar 12, 2019

Duration-of-Stay Storage Assignment under Uncertainty

Michael Lingzhi Li, Elliott Wolf, Daniel Wintz

Optimizing storage assignment is a central problem in warehousing. Past literature has shown the superiority of the Duration-of-Stay (DoS) method in assigning pallets, but the methodology requires perfect prior knowledge of DoS for each pallet, which is unknown and uncertain under realistic conditions. The dynamic nature of a warehouse further complicates the validity of synthetic data testing that is often conducted for algorithms. In this paper, in collaboration with a large cold storage company, we release the first publicly available set of warehousing records to facilitate research into this central problem. We introduce a new framework for storage assignment that accounts for uncertainty in warehouses. Then, by utilizing a combination of convolutional and recurrent neural network models, ParallelNet, we show that it is able to predict future shipments well: it achieves up to 29% decrease in MAPE compared to CNN-LSTM on unseen future shipments, and suffers less performance decay over time. The framework is then integrated into a first-of-its-kind Storage Assignment system, which is being piloted in warehouses across the country, with initial results showing up to 19% in labor savings.

7.0MLFeb 8, 2019

Scalable Holistic Linear Regression

Dimitris Bertsimas, Michael Lingzhi Li

We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples $n$ in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.

11.1OCDec 17, 2018

Interpretable Matrix Completion: A Discrete Optimization Approach

Dimitris Bertsimas, Michael Lingzhi Li

We consider the problem of matrix completion on an $n \times m$ matrix. We introduce the problem of Interpretable Matrix Completion that aims to provide meaningful insights for the low-rank matrix using side information. We show that the problem can be reformulated as a binary convex optimization problem. We design OptComplete, based on a novel concept of stochastic cutting planes to enable efficient scaling of the algorithm up to matrices of sizes $n=10^6$ and $m=10^6$. We report experiments on both synthetic and real-world datasets that show that OptComplete has favorable scaling behavior and accuracy when compared with state-of-the-art methods for other types of matrix completion, while providing insight on the factors that affect the matrix.