Shi Dong

LG
h-index75
14papers
319citations
Novelty54%
AI Score34

14 Papers

LGNov 29, 2022
Posterior Sampling for Continuing Environments

Wanqiao Xu, Shi Dong, Benjamin Van Roy · stanford

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At each time, with probability $1-γ$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(τS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $τ$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.

LGDec 24, 2022
Inclusive Artificial Intelligence

Dilip Arumugam, Shi Dong, Benjamin Van Roy · stanford

Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual. Evaluating models in these terms presumes homogeneous preferences across the population and engenders selection of agglomerative AIs, which fail to represent the diverse range of interests across individuals. We propose an alternative evaluation method that instead prioritizes inclusive AIs, which provably retain the requisite knowledge not only for subsequent response customization to particular segments of the population but also for utility-maximizing decisions.

CLJun 4, 2023
Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri et al.

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.

AIJul 7, 2022
UIILD: A Unified Interpretable Intelligent Learning Diagnosis Framework for Intelligent Tutoring Systems

Zhifeng Wang, Wenxing Yan, Chunyan Zeng et al.

Intelligent learning diagnosis is a critical engine of intelligent tutoring systems, which aims to estimate learners' current knowledge mastery status and predict their future learning performance. The significant challenge with traditional learning diagnosis methods is the inability to balance diagnostic accuracy and interpretability. Although the existing psychometric-based learning diagnosis methods provide some domain interpretation through cognitive parameters, they have insufficient modeling capability with a shallow structure for large-scale learning data. While the deep learning-based learning diagnosis methods have improved the accuracy of learning performance prediction, their inherent black-box properties lead to a lack of interpretability, making their results untrustworthy for educational applications. To settle the above problem, the proposed unified interpretable intelligent learning diagnosis (UIILD) framework, which benefits from the powerful representation learning ability of deep learning and the interpretability of psychometrics, achieves a better performance of learning prediction and provides interpretability from three aspects: cognitive parameters, learner-resource response network, and weights of self-attention mechanism. Within the proposed framework, this paper presents a two-channel learning diagnosis mechanism LDM-ID as well as a three-channel learning diagnosis mechanism LDM-HMI. Experiments on two real-world datasets and a simulation dataset show that our method has higher accuracy in predicting learners' performances compared with the state-of-the-art models, and can provide valuable educational interpretability for applications such as precise learning resource recommendation and personalized learning tutoring in intelligent tutoring systems.

LGNov 4, 2024Code
Leveraging Label Semantics and Meta-Label Refinement for Multi-Label Question Classification

Shi Dong, Xiaobei Niu, Rui Zhong et al.

Accurate annotation of educational resources is crucial for effective personalized learning and resource recommendation in online education. However, fine-grained knowledge labels often overlap or share similarities, making it difficult for existing multi-label classification methods to differentiate them. The label distribution imbalance due to sparsity of human annotations further intensifies these challenges. To address these issues, this paper introduces RR2QC, a novel Retrieval Reranking method to multi-label Question Classification by leveraging label semantics and meta-label refinement. First, RR2QC improves the pre-training strategy by utilizing semantic relationships within and across label groups. Second, it introduces a class center learning task to align questions with label semantics during downstream training. Finally, this method decomposes labels into meta-labels and uses a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels that frequently appear in other labels. Additionally, a mathematical LLM is used to generate solutions for questions, extracting latent information to further refine the model's insights. Experimental results show that RR2QC outperforms existing methods in Precision@K and F1 scores across multiple educational datasets, demonstrating its effectiveness for online education applications. The code and datasets are available at https://github.com/78Erii/RR2QC.

AIFeb 6, 2025
Agency Is Frame-Dependent

David Abel, André Barreto, Michael Bowling et al. · deepmind

Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.

AIMay 15, 2025
Plasticity as the Mirror of Empowerment

David Abel, Michael Bowling, André Barreto et al. · deepmind

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency

LGJan 23, 2025
Concurrent Learning with Aggregated States via Randomized Least Squares Value Iteration

Yan Chen, Qinxun Bai, Yiteng Zhang et al.

Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents {\it concurently} explore an environment. The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to \textit{randomized least-squares value iteration} (RLSVI) with \textit{aggregated state representation}. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of $Θ\left(\frac{1}{\sqrt{N}}\right)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to \cite{russo2019worst} and \cite{agrawal2021improved}. We reduce the space complexity by a factor of $K$ while incurring only a $\sqrt{K}$ increase in the worst-case regret bound, compared to \citep{agrawal2021improved,russo2019worst}. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.

CLMay 19, 2023
Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Wanqiao Xu, Shi Dong, Dilip Arumugam et al.

A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.

LGFeb 10, 2021
Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States

Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

We design a simple reinforcement learning (RL) agent that implements an optimistic version of $Q$-learning and establish through regret analysis that this agent can operate with some level of competence in any environment. While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis. This level of generality positions our results to inform the design of future agents for operation in complex real environments. We establish that, as time progresses, our agent performs competitively relative to policies that require longer times to evaluate. The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent. Notably, there is no dependence on the complexity of the environment. The ultimate per-period performance loss of the agent is bounded by a constant multiple of a measure of distortion introduced by the agent's state representation. This work is the first to establish that an algorithm approaches this asymptotic condition within a tractable time frame.

MLDec 13, 2019
Provably Efficient Reinforcement Learning with Aggregated States

Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + εHK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $ε$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions and indicates that asymptotic per-period regret is no greater than $ε$, independent of horizon. To our knowledge, this is the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.

LGNov 18, 2019
Comments on the Du-Kakade-Wang-Yang Lower Bounds

Benjamin Van Roy, Shi Dong

Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable. Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper. We compare these results and reconcile interpretations.

MLMay 12, 2019
On the Performance of Thompson Sampling on Logistic Bandits

Shi Dong, Tengyu Ma, Benjamin Van Roy

We study the logistic bandit, in which rewards are binary with success probability $\exp(βa^\top θ) / (1 + \exp(βa^\top θ))$ and actions $a$ and coefficients $θ$ are within the $d$-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $β$, we establish a regret bound for Thompson sampling that is independent of $β$. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$. We also establish a $\tilde{O}(\sqrt{dηT}/λ)$ bound that applies more broadly, where $λ$ is the worst-case optimal log-odds and $η$ is the "fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any $ε> 0$, no algorithm can achieve $\mathrm{poly}(d, 1/λ)\cdot T^{1-ε}$ regret.

MLMay 30, 2018
An Information-Theoretic Analysis for Thompson Sampling with Many Actions

Shi Dong, Benjamin Van Roy

Information-theoretic Bayesian regret bounds of Russo and Van Roy capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of rate-distortion. Among other things, this allows us to recover through information-theoretic arguments a near-optimal bound for the linear bandit. We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.