Ruohan Zhan

LG
h-index15
15papers
656citations
Novelty55%
AI Score46

15 Papers

LGJul 5, 2023
Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

Sanath Kumar Krishnamurthy, Ruohan Zhan, Susan Athey et al.

In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.

LGFeb 3, 2023
Two-Stage Constrained Actor-Critic for Short Video Recommendation

Qingpeng Cai, Zhenghai Xue, Chi Zhang et al.

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

IRJun 1, 2022
ResAct: Reinforcing Long-term Engagement in Sequential Recommendation with Residual Actor

Wanqi Xue, Qingpeng Cai, Ruohan Zhan et al.

Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online exploration. ResAct optimizes the policy by first reconstructing the online behaviors and then improving it via a Residual Actor. To extract long-term information, ResAct utilizes two information-theoretical regularizers to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a large-scale industrial dataset which consists of tens of millions of recommendation requests. Experimental results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.

MLFeb 17, 2023
Post Reinforcement Learning Inference

Vasilis Syrgkanis, Ruohan Zhan

We study estimation and inference using data collected by reinforcement learning (RL) algorithms. These algorithms adaptively experiment by interacting with individual units over multiple stages, updating their strategies based on past outcomes. Our goal is to evaluate a counterfactual policy after data collection and estimate structural parameters, such as dynamic treatment effects, that support credit assignment and quantify the impact of early actions on final outcomes. These parameters can often be defined as solutions to moment equations, motivating moment-based estimation methods developed for static data. In RL settings, however, data are often collected adaptively under nonstationary behavior policies. As a result, standard estimators fail to achieve asymptotic normality due to time-varying variance. We propose a weighted generalized method of moments (GMM) approach that uses adaptive weights to stabilize this variance. We characterize weighting schemes that ensure consistency and asymptotic normality of the weighted GMM estimators, enabling valid hypothesis testing and uniform confidence region construction. Key applications include dynamic treatment effect estimation and dynamic off-policy evaluation.

LGMay 26, 2022
Constrained Reinforcement Learning for Short Video Recommendation

Qingpeng Cai, Ruohan Zhan, Chi Zhang et al.

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users provide complex and multi-faceted responses towards recommendations, including watch time and various types of interactions with videos. As a result, established recommendation algorithms that concern a single objective are not adequate to meet this new demand of optimizing comprehensive user experiences. In this paper, we formulate the problem of short video recommendation as a constrained Markov Decision Process (MDP), where platforms want to optimize the main goal of user watch time in long term, with the constraint of accommodating the auxiliary responses of user interactions such as sharing/downloading videos. To solve the constrained MDP, we propose a two-stage reinforcement learning approach based on actor-critic framework. At stage one, we learn individual policies to optimize each auxiliary response. At stage two, we learn a policy to (i) optimize the main response and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive simulations, we demonstrate effectiveness of our approach over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our approach in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of watch time and interactions from video views. Our approach has been fully launched in the production system to optimize user experiences on the platform.

65.0LGMay 12
Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

Yunbei Xu, Yuzhe Yuan, Ruohan Zhan

We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(Ω(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order \(Ω(H)\) that holds for both decomposable policy classes and fully shared policies, matching the \(\widetilde O(H)\) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric {\color{black}through a sharp joint-KL oracle theory}. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.

LGDec 18, 2024
Distributionally Robust Policy Learning under Concept Drifts

Jingyuan Wang, Zhimei Ren, Ruohan Zhan et al.

Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem -- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $Π$, and show that the sub-optimality gap of the proposed algorithm is of the order $κ(Π)n^{-1/2}$, where $κ(Π)$ is the entropy integral of $Π$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

LGFeb 18, 2025
Fragility-aware Classification for Understanding Risk and Improving Generalization

Chen Yang, Zheng Cui, Daniel Zhuoyu Long et al.

Classification models play a critical role in data-driven decision-making applications such as medical diagnosis, user profiling, recommendation systems, and default detection. Traditional performance metrics, such as accuracy, focus on overall error rates but fail to account for the confidence of incorrect predictions, thereby overlooking the risk of confident misjudgments. This risk is particularly significant in cost-sensitive and safety-critical domains like medical diagnosis and autonomous driving, where overconfident false predictions may cause severe consequences. To address this issue, we introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective by explicitly capturing the tail risk of confident misjudgments. To enhance generalizability, we define FI within the robust satisficing (RS) framework, incorporating data uncertainty. We further develop a model training approach that optimizes FI while maintaining tractability for common loss functions. Specifically, we derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models. Through synthetic experiments and real-world medical diagnosis tasks, we demonstrate that FI effectively identifies misjudgment risk and FI-based training improves model robustness and generalizability. Finally, we extend our framework to deep neural network training, further validating its effectiveness in enhancing deep learning models.

EMJun 20, 2024
Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach

Ruohan Zhan, Shichao Han, Yuchen Hu et al.

Recommender systems are essential for content-sharing platforms by curating personalized content. To improve recommender systems, platforms frequently rely on creator-side randomized experiments to evaluate algorithm updates. We show that commonly adopted difference-in-means estimators can lead to severely biased estimates due to recommender interference, where treated and control creators compete for exposure. This bias can result in incorrect business decisions. To address this, we propose a ``recommender choice model'' that explicitly represents the interference pathway. The approach combines a structural choice framework with neural networks to account for rich viewer-content heterogeneity. Building on this foundation, we develop a debiased estimator using the double machine learning (DML) framework to adjust for errors from nuisance component estimation. We show that the estimator is $\sqrt{n}$-consistent and asymptotically normal, and we extend the DML theory to handle correlated data, which arise in our context due to overlapped items. We validate our method with a large-scale field experiment on Weixin short-video platform, using a costly double-sided randomization design to obtain an interference-free ground truth. Our results show that the proposed estimator successfully recovers this ground truth, whereas benchmark estimators exhibit substantial bias, and in some cases, yield reversed signs.

LGJun 7, 2024
Adaptively Learning to Select-Rank in Online Platforms

Jingyuan Wang, Perry Dong, Ying Jin et al.

Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of $O(d\sqrt{NKT})$ for ranking $K$ out of $N$ items in a $d$-dimensional context space over $T$ rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with $N$ and $K$ (thus rendering direct application of existing adaptive learning algorithms -- such as UCB or Thompson sampling -- infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

MLJun 3, 2021
Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

Ruohan Zhan, Vitor Hadad, David A. Hirshberg et al.

It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.

LGMay 6, 2021
Towards Content Provider Aware Recommender Systems: A Simulation Study on the Interplay between User and Provider Utilities

Ruohan Zhan, Konstantina Christakopoulou, Ya Le et al.

Most existing recommender systems focus primarily on matching users to content which maximizes user satisfaction on the platform. It is increasingly obvious, however, that content providers have a critical influence on user satisfaction through content creation, largely determining the content pool available for recommendation. A natural question thus arises: can we design recommenders taking into account the long-term utility of both users and content providers? By doing so, we hope to sustain more providers and a more diverse content pool for long-term user satisfaction. Understanding the full impact of recommendations on both user and provider groups is challenging. This paper aims to serve as a research investigation of one approach toward building a provider-aware recommender, and evaluating its impact in a simulated setup. To characterize the user-recommender-provider interdependence, we complement user modeling by formalizing provider dynamics as well. The resulting joint dynamical system gives rise to a weakly-coupled partially observable Markov decision process driven by recommender actions and user feedback to providers. We then build a REINFORCE recommender agent, coined EcoAgent, to optimize a joint objective of user utility and the counterfactual utility lift of the provider associated with the recommended content, which we show to be equivalent to maximizing overall user utility and the utilities of all providers on the platform under some mild assumptions. To evaluate our approach, we introduce a simulation environment capturing the key interactions among users, providers, and the recommender. We offer a number of simulated experiments that shed light on both the benefits and the limitations of our approach. These results help understand how and when a provider-aware recommender agent is of benefit in building multi-stakeholder recommender systems.

MLMay 5, 2021
Policy Learning with Adaptively Collected Data

Ruohan Zhan, Zhimei Ren, Susan Athey et al.

Learning optimal policies from historical data enables personalization in a wide variety of applications including healthcare, digital recommendations, and online education. The growing policy learning literature focuses on settings where the data collection rule stays fixed throughout the experiment. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g. contextual bandits). Yet adaptivity complicates the optimal policy identification ex post, since samples are dependent, and each treatment may not receive enough observations for each type of individual. In this paper, we make initial research inquiries into addressing the challenges of learning the optimal policy with adaptively collected data. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which non-uniformly reweight the elements of a standard AIPW estimator to control worst-case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.

MMJan 14, 2020
Distortion Agnostic Deep Watermarking

Xiyang Luo, Ruohan Zhan, Huiwen Chang et al.

Watermarking is the process of embedding information into an image that can survive under distortions, while requiring the encoded image to have little or no perceptual difference from the original image. Recently, deep learning-based methods achieved impressive results in both visual quality and message payload under a wide variety of image distortions. However, these methods all require differentiable models for the image distortions at training time, and may generalize poorly to unknown distortions. This is undesirable since the types of distortions applied to watermarked images are usually unknown and non-differentiable. In this paper, we propose a new framework for distortion-agnostic watermarking, where the image distortion is not explicitly modeled during training. Instead, the robustness of our system comes from two sources: adversarial training and channel coding. Compared to training on a fixed set of distortions and noise levels, our method achieves comparable or better results on distortions available during training, and better performance on unknown distortions.

MLNov 7, 2019
Confidence Intervals for Policy Evaluation in Adaptive Experiments

Vitor Hadad, David A. Hirshberg, Ruohan Zhan et al.

Adaptive experiment designs can dramatically improve statistical efficiency in randomized trials, but they also complicate statistical inference. For example, it is now well known that the sample mean is biased in adaptive trials. Inferential challenges are exacerbated when our parameter of interest differs from the parameter the trial was designed to target, such as when we are interested in estimating the value of a sub-optimal treatment after running a trial to determine the optimal treatment using a stochastic bandit design. In this context, typical estimators that use inverse propensity weighting to eliminate sampling bias can be problematic: their distributions become skewed and heavy-tailed as the propensity scores decay to zero. In this paper, we present a class of estimators that overcome these issues. Our approach is to adaptively reweight the terms of an augmented inverse propensity weighting estimator to control the contribution of each term to the estimator's variance. This adaptive weighting scheme prevents estimates from becoming heavy-tailed, ensuring asymptotically correct coverage. It also reduces variance, allowing us to test hypotheses with greater power - especially hypotheses that were not targeted by the experimental design. We validate the accuracy of the resulting estimates and their confidence intervals in numerical experiments and show our methods compare favorably to existing alternatives in terms of RMSE and coverage.