David Zhao

8papers

100citations

Novelty53%

AI Score27

Ranked #162,019 of 205,806 authors (top 79%)#2,487 in ML (top 71%)

8 Papers

MLMay 29, 2022

Towards Instance-Wise Calibration: Local Amortized Diagnostics and Reshaping of Conditional Densities (LADaR)

Biprateep Dey, David Zhao, Brett H. Andrews et al.

Key science questions, such as galaxy distance estimation and weather forecasting, often require knowing the full predictive distribution of a target variable $y$ given complex inputs $\mathbf{x}$. Despite recent advances in machine learning and physics-based models, it remains challenging to assess whether an initial model is calibrated for all $\mathbf{x}$, and when needed, to reshape the densities of $y$ toward "instance-wise" calibration. This paper introduces the LADaR (Local Amortized Diagnostics and Reshaping of Conditional Densities) framework and proposes a new computationally efficient algorithm ($\texttt{Cal-PIT}$) that produces interpretable local diagnostics and provides a mechanism for adjusting conditional density estimates (CDEs). $\texttt{Cal-PIT}$ learns a single interpretable local probability--probability map from calibration data that identifies where and how the initial model is miscalibrated across feature space, which can be used to morph CDEs such that they are well-calibrated. We illustrate the LADaR framework on synthetic examples, including probabilistic forecasting from image sequences, akin to predicting storm wind speed from satellite imagery. Our main science application involves estimating the probability density functions of galaxy distances given photometric data, where $\texttt{Cal-PIT}$ achieves better instance-wise calibration than all 11 other literature methods in a benchmark data challenge, demonstrating its utility for next-generation cosmological analyses.

LGAug 30, 2022

Dynamic Global Sensitivity for Differentially Private Contextual Bandits

Huazheng Wang, David Zhao, Hongning Wang

Bandit algorithms have become a reference solution for interactive recommendation. However, as such algorithms directly interact with users for improved recommendations, serious privacy concerns have been raised regarding its practical use. In this work, we propose a differentially private linear contextual bandit algorithm, via a tree-based mechanism to add Laplace or Gaussian noise to model parameters. Our key insight is that as the model converges during online update, the global sensitivity of its parameters shrinks over time (thus named dynamic global sensitivity). Compared with existing solutions, our dynamic global sensitivity analysis allows us to inject less noise to obtain $(ε, δ)$-differential privacy with added regret caused by noise injection in $\tilde O(\log{T}\sqrt{T}/ε)$. We provide a rigorous theoretical analysis over the amount of noise added via dynamic global sensitivity and the corresponding upper regret bound of our proposed algorithm. Experimental results on both synthetic and real-world datasets confirmed the algorithm's advantage against existing solutions.

MLJul 8, 2021Code

Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning for Reliable Simulator-Based Inference

Niccolò Dalmasso, Luca Masserano, David Zhao et al.

Many areas of science rely on simulators that implicitly encode intractable likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, especially outside asymptotic and low-dimensional regimes. At the same time, popular LFI methods - such as Approximate Bayesian Computation or more recent machine learning techniques - do not necessarily lead to valid scientific inference because they do not guarantee confidence sets with nominal coverage in general settings. In addition, LFI currently lacks practical diagnostic tools to check the actual coverage of computed confidence sets across the entire parameter space. In this work, we propose a modular inference framework that bridges classical statistics and modern machine learning to provide (i) a practical approach for constructing confidence sets with near finite-sample validity at any value of the unknown parameters, and (ii) interpretable diagnostics for estimating empirical coverage across the entire parameter space. We refer to this framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic can leverage LF2I to create valid confidence sets and diagnostics without costly Monte Carlo or bootstrap samples at fixed parameter settings. We study two likelihood-based test statistics (ACORE and BFF) and demonstrate their performance on high-dimensional complex data. Code is available at https://github.com/lee-group-cmu/lf2i.

IMOct 28, 2021

Re-calibrating Photometric Redshift Probability Distributions Using Feature-space Regression

Biprateep Dey, Jeffrey A. Newman, Brett H. Andrews et al.

Many astrophysical analyses depend on estimates of redshifts (a proxy for distance) determined from photometric (i.e., imaging) data alone. Inaccurate estimates of photometric redshift uncertainties can result in large systematic errors. However, probability distribution outputs from many photometric redshift methods do not follow the frequentist definition of a Probability Density Function (PDF) for redshift -- i.e., the fraction of times the true redshift falls between two limits $z_{1}$ and $z_{2}$ should be equal to the integral of the PDF between these limits. Previous works have used the global distribution of Probability Integral Transform (PIT) values to re-calibrate PDFs, but offsetting inaccuracies in different regions of feature space can conspire to limit the efficacy of the method. We leverage a recently developed regression technique that characterizes the local PIT distribution at any location in feature space to perform a local re-calibration of photometric redshift PDFs. Though we focus on an example from astrophysics, our method can produce PDFs which are calibrated at all locations in feature space for any use case.

MLJul 7, 2021

MD-split+: Practical Local Conformal Inference in High Dimensions

Benjamin LeRoy, David Zhao

Quantifying uncertainty in model predictions is a common goal for practitioners seeking more than just point predictions. One tool for uncertainty quantification that requires minimal assumptions is conformal inference, which can help create probabilistically valid prediction regions for black box models. Classical conformal prediction only provides marginal validity, whereas in many situations locally valid prediction regions are desirable. Deciding how best to partition the feature space X when applying localized conformal prediction is still an open question. We present MD-split+, a practical local conformal approach that creates X partitions based on localized model performance of conditional density estimation models. Our method handles complex real-world data settings where such models may be misspecified, and scales to high-dimensional inputs. We discuss how our local partitions philosophically align with expected behavior from an unattainable conditional conformal inference approach. We also empirically compare our method against other local conformal approaches.

NIDec 4, 2019

Reinforcement learning for bandwidth estimation and congestion control in real-time communications

Joyce Fang, Martin Ellis, Bin Li et al.

Bandwidth estimation and congestion control for real-time communications (i.e., audio and video conferencing) remains a difficult problem, despite many years of research. Achieving high quality of experience (QoE) for end users requires continual updates due to changing network architectures and technologies. In this paper, we apply reinforcement learning for the first time to the problem of real-time communications (RTC), where we seek to optimize user-perceived quality. We present initial proof-of-concept results, where we learn an agent to control sending rate in an RTC system, evaluating using both network simulation and real Internet video calls. We discuss the challenges we observed, particularly in designing realistic reward functions that reflect QoE, and in bridging the gap between the training environment and real-world networks.

TRNov 26, 2019

Cryptocurrency Price Prediction and Trading Strategies Using Support Vector Machines

David Zhao, Alessandro Rinaldo, Christopher Brookins

Few assets in financial history have been as notoriously volatile as cryptocurrencies. While the long term outlook for this asset class remains unclear, we are successful in making short term price predictions for several major crypto assets. Using historical data from July 2015 to November 2019, we develop a large number of technical indicators to capture patterns in the cryptocurrency market. We then test various classification methods to forecast short-term future price movements based on these indicators. On both PPV and NPV metrics, our classifiers do well in identifying up and down market moves over the next 1 hour. Beyond evaluating classification accuracy, we also develop a strategy for translating 1-hour-ahead class predictions into trading decisions, along with a backtester that simulates trading in a realistic environment. We find that support vector machines yield the most profitable trading strategies, which outperform the market on average for Bitcoin, Ethereum and Litecoin over the past 22 months, since January 2018.

CROct 30, 2018

SAFE-PDF: Robust Detection of JavaScript PDF Malware Using Abstract Interpretation

Alexander Jordan, François Gauthier, Behnaz Hassanshahi et al.

The popularity of the PDF format and the rich JavaScript environment that PDF viewers offer make PDF documents an attractive attack vector for malware developers. PDF documents present a serious threat to the security of organizations because most users are unsuspecting of them and thus likely to open documents from untrusted sources. We propose to identify malicious PDFs by using conservative abstract interpretation to statically reason about the behavior of the embedded JavaScript code. Currently, state-of-the-art tools either: (1) statically identify PDF malware based on structural similarity to known malicious samples; or (2) dynamically execute the code to detect malicious behavior. These two approaches are subject to evasion attacks that mimic the structure of benign documents or do not exhibit their malicious behavior when being analyzed dynamically. In contrast, abstract interpretation is oblivious to both types of evasions. A comparison with two state-of-the-art PDF malware detection tools shows that our conservative abstract interpretation approach achieves similar accuracy, while being more resilient to evasion attacks.