Aaron Sonabend-W

4papers

60citations

Novelty59%

AI Score43

Ranked #80,107 of 201,326 authors (top 40%)#17,993 in LG (top 42%)

4 Papers

MLJun 11, 2022

Federated Offline Reinforcement Learning

Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W et al.

Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessary and promising to deal with the problems. In this paper, we propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites. The proposed model makes the analysis of the site-level features possible. We design the first federated policy optimization algorithm for offline RL with sample complexity. The proposed algorithm is communication-efficient, which requires only a single round of communication interaction by exchanging summary statistics. We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed. Extensive simulations demonstrate the effectiveness of the proposed algorithm. The method is applied to a sepsis dataset in multiple sites to illustrate its use in clinical settings.

74.5AO-PHMay 17

Quantification of atmospheric carbon dioxide from the Geostationary Operational Environmental Satellite (GOES East)

Aaron Sonabend-W, Sean Campbell, John Platt et al.

There is a growing urgency to track greenhouse gasses with the resolution, precision and accuracy needed to support independent verification of $CO_2$ fluxes at local to global scales. The current generation of space-based sensors, however, only provides sparse observations in space and time. This challenge has fueled interest in the potential use of data from existing missions originally developed for other applications for inferring global greenhouse gas variability. The Advanced Baseline Imager (ABI) onboard the Geostationary Operational Environmental Satellite (GOES-East), operational since 2017, provides full coverage of much of the western hemisphere at 10-minute intervals from geostationary orbit at 16 wavelengths at an approximately 2$km^2$ spatial resolution. Here, we leverage this high spatial coverage and temporal revisit to develop a single-pixel, physics-guided neural network to estimate dry-air column $CO_2$ mole fraction ($XCO_2$). The model employs a time series of GOES-East's 16 spectral bands, ECMWF ERA5 lower tropospheric meteorology, MODIS surface reflectance, solar and satellite viewing geometry, and day of year. Training used collocated GOES-East and OCO-2/OCO-3 observations. We also present case studies illustrating the use of the model to observe $XCO_2$ enhancements over urban areas and drawdown over agricultural regions. Overall, while the precision of GOES-East derived $XCO_2$ can never rival that of dedicated instruments, the unprecedented combination of contiguous geographic coverage, 10-minute temporal frequency, and multi-year record offers the potential to observe aspects of atmospheric $CO_2$ variability currently unseen from space.

LGDec 9, 2020

Semi-Supervised Off Policy Reinforcement Learning

Aaron Sonabend-W, Nilanjana Laha, Ashwin N. Ananthakrishnan et al.

Reinforcement learning (RL) has shown great success in estimating sequential treatment strategies which take into account patient heterogeneity. However, health-outcome information, which is used as the reward for reinforcement learning methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small sized labeled data with true outcome observed, and a large unlabeled data with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off policy value estimation. Generalizing SSL to sequential treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative to the optimal policy or value function. We provide theoretical results for our Q-function and value function estimators to understand to what degree efficiency can be gained from SSL. Our method is at least as efficient as the supervised approach, and moreover safe as it robust to mis-specification of the imputation models.

LGJun 23, 2020

Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Aaron Sonabend-W, Junwei Lu, Leo A. Celi et al.

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRL is independent of the chosen risk aversion threshold and quality of the behavior policy.