Duncan McElfresh

CY
h-index16
6papers
367citations
Novelty43%
AI Score46

6 Papers

IRJun 23, 2022Code
On the Generalizability and Predictability of Recommender Systems

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde et al.

While other areas of machine learning have seen more and more automation, designing a high-performing recommender system still requires a high level of human effort. Furthermore, recent work has shown that modern recommender system algorithms do not always improve over well-tuned baselines. A natural follow-up question is, "how do we choose the right algorithm for a new dataset and performance metric?" In this work, we start by giving the first large-scale study of recommender system approaches by comparing 18 algorithms and 100 sets of hyperparameters across 85 datasets and 315 metrics. We find that the best algorithms and hyperparameters are highly dependent on the dataset and performance metric, however, there are also strong correlations between the performance of each algorithm and various meta-features of the datasets. Motivated by these findings, we create RecZilla, a meta-learning approach to recommender systems that uses a model to predict the best algorithm and hyperparameters for new, unseen datasets. By using far more meta-training data than prior work, RecZilla is able to substantially reduce the level of human involvement when faced with a new recommender system application. We not only release our code and pretrained RecZilla models, but also all of our raw experimental results, so that practitioners can train a RecZilla model for their desired performance metric: https://github.com/naszilla/reczilla.

CYApr 30
Adoption and Use of LLMs at an Academic Medical Center

Nigam H. Shah, Nerissa Ambers, Abby Pandya et al.

While large language models (LLMs) can support clinical documentation needs, standalone tools struggle with "workflow friction" from manual data entry. We developed ChatEHR, a system that enables the use of LLMs with the entire patient timeline spanning several years. ChatEHR enables automations - which are static combinations of prompts and data that perform a fixed task - and interactive use in the electronic health record (EHR) via a user interface (UI). The resulting ability to sift through patient medical records for diverse use-cases such as pre-visit chart review, screening for transfer eligibility, monitoring for surgical site infections, and chart abstraction, redefines LLM use as an institutional capability. This system, accessible after user-training, enables continuous monitoring and evaluation of LLM use. In 1.5 years, we built 7 automations and 1075 users have trained to become routine users of the UI, engaging in 23,000 sessions in the first 3 months of launch. For automations, being model-agnostic and accessing multiple types of data was essential for matching specific clinical or administrative tasks with the most appropriate LLM. Benchmark-based evaluations proved insufficient for monitoring and evaluation of the UI, requiring new methods to monitor performance. Generation of summaries was the most frequent task in the UI, with an estimated 0.73 hallucinations and 1.60 inaccuracies per generation. The resulting mix of cost savings, time savings, and revenue growth required a value assessment framework to prioritize work as well as quantify the impact of using LLMs. Initial estimates are $6M savings in the first year of use, without quantifying the benefit of the better care offered. Such a "build-from-within" strategy provides an opportunity for health systems to maintain agency via a vendor-agnostic, internally governed LLM platform.

CYFeb 27, 2024Code
Standing on FURM ground -- A framework for evaluating Fair, Useful, and Reliable AI Models in healthcare systems

Alison Callahan, Duncan McElfresh, Juan M. Banda et al.

The impact of using artificial intelligence (AI) to guide patient care or operational processes is an interplay of the AI model's output, the decision-making protocol based on that output, and the capacity of the stakeholders involved to take the necessary subsequent action. Estimating the effects of this interplay before deployment, and studying it in real time afterwards, are essential to bridge the chasm between AI model development and achievable benefit. To accomplish this, the Data Science team at Stanford Health Care has developed a Testing and Evaluation (T&E) mechanism to identify fair, useful and reliable AI models (FURM) by conducting an ethical review to identify potential value mismatches, simulations to estimate usefulness, financial projections to assess sustainability, as well as analyses to determine IT feasibility, design a deployment strategy, and recommend a prospective monitoring and evaluation plan. We report on FURM assessments done to evaluate six AI guided solutions for potential adoption, spanning clinical and operational settings, each with the potential to impact from several dozen to tens of thousands of patients each year. We describe the assessment process, summarize the six assessments, and share our framework to enable others to conduct similar assessments. Of the six solutions we assessed, two have moved into a planning and implementation phase. Our novel contributions - usefulness estimates by simulation, financial projections to quantify sustainability, and a process to do ethical assessments - as well as their underlying methods and open source tools, are available for other healthcare systems to conduct actionable evaluations of candidate AI solutions.

LGMay 4, 2023Code
When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde et al.

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

OCMar 4, 2020
Robust Active Preference Elicitation

Phebe Vayanos, Yingxiao Ye, Duncan McElfresh et al.

We study the problem of eliciting the preferences of a decision-maker through a moderate number of pairwise comparison queries to make them a high quality recommendation for a specific problem. We are motivated by applications in high stakes domains, such as when choosing a policy for allocating scarce resources to satisfy basic needs (e.g., kidneys for transplantation or housing for those experiencing homelessness) where a consequential recommendation needs to be made from the (partially) elicited preferences. We model uncertainty in the preferences as being set based and} investigate two settings: a) an offline elicitation setting, where all queries are made at once, and b) an online elicitation setting, where queries are selected sequentially over time in an adaptive fashion. We propose robust optimization formulations of these problems which integrate the preference elicitation and recommendation phases with aim to either maximize worst-case utility or minimize worst-case regret, and study their complexity. For the offline case, where active preference elicitation takes the form of a two and half stage robust optimization problem with decision-dependent information discovery, we provide an equivalent reformulation in the form of a mixed-binary linear program which we solve via column-and-constraint generation. For the online setting, where active preference learning takes the form of a multi-stage robust optimization problem with decision-dependent information discovery, we propose a conservative solution approach. Numerical studies on synthetic data demonstrate that our methods outperform state-of-the art approaches from the literature in terms of worst-case rank, regret, and utility. We showcase how our methodology can be used to assist a homeless services agency in choosing a policy for allocating scarce housing resources of different types to people experiencing homelessness.

CYJan 13, 2020
Artificial Artificial Intelligence: Measuring Influence of AI 'Assessments' on Moral Decision-Making

Lok Chan, Kenzie Doyle, Duncan McElfresh et al.

Given AI's growing role in modeling and improving decision-making, how and when to present users with feedback is an urgent topic to address. We empirically examined the effect of feedback from false AI on moral decision-making about donor kidney allocation. We found some evidence that judgments about whether a patient should receive a kidney can be influenced by feedback about participants' own decision-making perceived to be given by AI, even if the feedback is entirely random. We also discovered different effects between assessments presented as being from human experts and assessments presented as being from AI.