Zhixuan Fang

5.8LGAug 30, 2022Code

Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent Reinforcement Learning

Pihe Hu, Ling Pan, Yu Chen et al.

Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. Yet, it poses a critical challenge since the scheduler needs to make real-time decisions to guarantee the delay and resource constraints simultaneously without prior information of system dynamics, which can be time-varying and hard to estimate. Moreover, many practical scenarios suffer from partial observability issues, e.g., due to sensing noise or hidden correlation. To tackle these challenges, we propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ($\mathtt{RSD4}$), which is a data-driven method based on a Partially Observed Markov Decision Process (POMDP) formulation. $\mathtt{RSD4}$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level merging to ensure scalability. Extensive experiments on simulated/real-world datasets demonstrate that $\mathtt{RSD4}$ is robust to system dynamics and partially observable environments, and achieves superior performances over existing DRL and non-DRL-based methods.

7.4CRMay 4

Proof-of-Learning with Incentive Security

Zishuo Zhao, Zhixuan Fang, Xuechao Wang et al.

Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or Proof-of-Stake (PoS) mechanisms for decentralized consensus and security assurance. However, the substantial energy expenditure stemming from computationally intensive yet meaningless tasks has raised considerable concerns surrounding traditional PoW approaches, The PoS mechanism, while free of energy consumption, is subject to security and economic issues. Addressing these issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ challenges of practical significance as PoW, thereby imbuing energy consumption with tangible value. While previous efforts in Proof of Learning (PoL) explored the utilization of deep learning model training SGD tasks as PoUW challenges, recent research has revealed its vulnerabilities to adversarial attacks and the theoretical hardness in crafting a byzantine-secure PoL mechanism. In this paper, we introduce the concept of incentive-security that incentivizes rational provers to behave honestly for their best interest, bypassing the existing hardness to design a PoL mechanism with computational efficiency, a provable incentive-security guarantee and controllable difficulty. Particularly, our work is secure against two attacks, and also improves the computational overhead from $Θ(1)$ to $O(\frac{\log E}{E})$. Furthermore, while most recent research assumes trusted problem providers and verifiers, our design also guarantees frontend incentive-security even when problem providers are untrusted, and verifier incentive-security that bypasses the Verifier's Dilemma. By incorporating ML training into blockchain consensus mechanisms with provable guarantees, our research not only proposes an eco-friendly solution to blockchain systems, but also provides a proposal for a completely decentralized computing power market in the new AI age.

7.7DCJun 3

Clownfish: Scaling DAG-based BFT Consensus via Sparse Edges

Feifan Wang, Jingfan Yu, Zixi Cai et al.

Directed Acyclic Graph (DAG) based BFT protocols have demonstrated the capability to achieve significantly high throughput in practice. Recent advancements focused on minimizing the good-case latency of these protocols, approaching the theoretical lower bound. However, the high communication complexity inherent in existing DAG-based protocols limits their scalability. This primarily arises because each vertex in the DAG must include a linear number of edges (references) to vertices from previous rounds. We present Clownfish, a partially synchronous DAG-based BFT protocol designed to address the scalability bottleneck. Clownfish achieves lower communication complexity by selectively reducing the number of edges in DAG vertices. When using a communication-optimal consistent broadcast, Clownfish attains quadratic total communication complexity per round, outperforming prior DAG-based protocols. Clownfish also reduces the additional latency in failure cases by optimizing the round advancement rule. Additionally, Clownfish supports multiple leaders per round to reduce average latency while maintaining its lower communication complexity. Our experimental evaluation demonstrates that Clownfish provides significantly better scalability than existing DAG-based protocols.

6.8GTApr 17

The Power of Information for Intermediate States in Contract Design

Yirui Zhang, Zhixuan Fang

In the conventional principal-agent problem, a principal delegates a task to an agent and formulates a contract to incentivize the agent's actions on behalf of the principal. However, this framework overlooks the information that is possibly available during the delegation process in some scenarios. To address this limitation, we propose a novel model that incorporates multiple intermediate states to capture such information revealed during the delegation. Furthermore, to evaluate the impact of the information embedded in these intermediate states, we introduce two distinct contracts: the pay-halfway contract, which provides payments based not only on final outcomes but also on intermediate states, and the terminate-halfway contract, which allows the principal to terminate the delegation process upon encountering undesirable intermediate states. This leads to the question of whether and how these contract types can leverage intermediate-state information? In particular, we ask: Can these contract types outperform standard contracts, and if so, when and to what extent? We answer the first question affirmatively and provide several important insights regarding the second, shedding light on the circumstances in which intermediate-state-aware contracts yield substantial advantages.

4.2CRApr 13, 2024

Proof-of-Learning with Incentive Security

Zishuo Zhao, Zhixuan Fang, Xuechao Wang et al.

Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or Proof-of-Stake (PoS) mechanisms for decentralized consensus and security assurance. However, the substantial energy expenditure stemming from computationally intensive yet meaningless tasks has raised considerable concerns surrounding traditional PoW approaches, The PoS mechanism, while free of energy consumption, is subject to security and economic issues. Addressing these issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ challenges of practical significance as PoW, thereby imbuing energy consumption with tangible value. While previous efforts in Proof of Learning (PoL) explored the utilization of deep learning model training SGD tasks as PoUW challenges, recent research has revealed its vulnerabilities to adversarial attacks and the theoretical hardness in crafting a byzantine-secure PoL mechanism. In this paper, we introduce the concept of incentive-security that incentivizes rational provers to behave honestly for their best interest, bypassing the existing hardness to design a PoL mechanism with computational efficiency, a provable incentive-security guarantee and controllable difficulty. Particularly, our work is secure against two attacks, and also improves the computational overhead from $Θ(1)$ to $O(\frac{\log E}{E})$. Furthermore, while most recent research assumes trusted problem providers and verifiers, our design also guarantees frontend incentive-security even when problem providers are untrusted, and verifier incentive-security that bypasses the Verifier's Dilemma. By incorporating ML training into blockchain consensus mechanisms with provable guarantees, our research not only proposes an eco-friendly solution to blockchain systems, but also provides a proposal for a completely decentralized computing power market in the new AI age.

6.6GTMar 7, 2024

RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning

Boning Li, Zhixuan Fang, Longbo Huang

Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64\pm 11$ and $84\pm 17$ mbb/hand, respectively.

4.1LGFeb 19, 2025

Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits

Jiayuan Liu, Siwei Wang, Zhixuan Fang

In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art $O(K\log T/Δ) + O(C/Δ)$ regret upper bound, where $K$ is the number of arms, $C$ is the unknown corruption level, $Δ$ is the minimum expected reward gap between the best arm and other ones, and $T$ is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is $O(K\log^2 T/Δ) + O(C)$, we show that SAMBA reduces one $\log T$ factor in the regret bound, while maintaining the corruption-dependent term to be linear with $C$. This is indeed asymptotically optimal. We also conduct simulations to demonstrate the effectiveness of SAMBA, and the results show that SAMBA outperforms existing baselines.

7.3CRJan 21, 2024

Tempo: Confidentiality Preservation in Cloud-Based Neural Network Training

Rongwu Xu, Zhixuan Fang

Cloud deep learning platforms provide cost-effective deep neural network (DNN) training for customers who lack computation resources. However, cloud systems are often untrustworthy and vulnerable to attackers, leading to growing concerns about model privacy. Recently, researchers have sought to protect data privacy in deep learning by leveraging CPU trusted execution environments (TEEs), which minimize the use of cryptography, but existing works failed to simultaneously utilize the computational resources of GPUs to assist in training and prevent model leakage. This paper presents Tempo, the first cloud-based deep learning system that cooperates with TEE and distributed GPUs for efficient DNN training with model confidentiality preserved. To tackle the challenge of preserving privacy while offloading linear algebraic operations from TEE to GPUs for efficient batch computation, we introduce a customized permutation-based obfuscation algorithm to blind both inputs and model parameters. An optimization mechanism that reduces encryption operations is proposed for faster weight updates during backpropagation to speed up training. We implement Tempo and evaluate it with both training and inference for two prevalent DNNs. Empirical results indicate that Tempo outperforms baselines and offers sufficient privacy protection.

6.6CLMay 11, 2023Code

Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker Selection and Data Augmentation

Yujie Wang, Chao Huang, Liner Yang et al.

This paper introduces a novel crowdsourcing worker selection algorithm, enhancing annotation quality and reducing costs. Unlike previous studies targeting simpler tasks, this study contends with the complexities of label interdependencies in sequence labeling. The proposed algorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection, and a cost-effective human feedback mechanism. The challenge of dealing with imbalanced and small-scale datasets, which hinders offline simulation of worker selection, is tackled using an innovative data augmentation method termed shifting, expanding, and shrinking (SES). Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased the algorithm's efficiency, with an increase in F1 score up to 100.04% of the expert-only baseline, alongside cost savings up to 65.97%. The paper also encompasses a dataset-independent test emulating annotation evaluation through a Bernoulli distribution, which still led to an impressive 97.56% F1 score of the expert baseline and 59.88% cost savings. Furthermore, our approach can be seamlessly integrated into Reinforcement Learning from Human Feedback (RLHF) systems, offering a cost-effective solution for obtaining human feedback.

13.6OCNov 15, 2021

Simultaneously Achieving Sublinear Regret and Constraint Violations for Online Convex Optimization with Time-varying Constraints

Qingsong Liu, Wenfei Wu, Longbo Huang et al.

In this paper, we develop a novel virtual-queue-based online algorithm for online convex optimization (OCO) problems with long-term and time-varying constraints and conduct a performance analysis with respect to the dynamic regret and constraint violations. We design a new update rule of dual variables and a new way of incorporating time-varying constraint functions into the dual variables. To the best of our knowledge, our algorithm is the first parameter-free algorithm to simultaneously achieve sublinear dynamic regret and constraint violations. Our proposed algorithm also outperforms the state-of-the-art results in many aspects, e.g., our algorithm does not require the Slater condition. Meanwhile, for a group of practical and widely-studied constrained OCO problems in which the variation of consecutive constraints is smooth enough across time, our algorithm achieves $O(1)$ constraint violations. Furthermore, we extend our algorithm and analysis to the case when the time horizon $T$ is unknown. Finally, numerical experiments are conducted to validate the theoretical guarantees of our algorithm, and some applications of our proposed framework will be outlined.

3.1LGFeb 24, 2021

Continuous Mean-Covariance Bandits

Yihan Du, Siwei Wang, Zhixuan Fang et al.

Existing risk-aware multi-armed bandit models typically focus on risk measures of individual options such as variance. As a result, they cannot be directly applied to important real-world online decision making problems with correlated options. In this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model to explicitly take into account option correlation. Specifically, in CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback. We propose novel algorithms with optimal regrets (within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms. To the best of our knowledge, this is the first work that considers option correlation in risk-aware bandits and explicitly quantifies how arbitrary covariance structures impact the learning performance. The novel analytical techniques we developed for exploiting the estimated covariance to build concentration and bounding the risk of selected actions based on sampling strategy properties can likely find applications in other bandit analysis and be of independent interests.

13.2AIFeb 13, 2018

A Deep Reinforcement Learning Framework for Rebalancing Dockless Bike Sharing Systems

Ling Pan, Qingpeng Cai, Zhixuan Fang et al.

Bike sharing provides an environment-friendly way for traveling and is booming all over the world. Yet, due to the high similarity of user travel patterns, the bike imbalance problem constantly occurs, especially for dockless bike sharing systems, causing significant impact on service quality and company revenue. Thus, it has become a critical task for bike sharing systems to resolve such imbalance efficiently. In this paper, we propose a novel deep reinforcement learning framework for incentivizing users to rebalance such systems. We model the problem as a Markov decision process and take both spatial and temporal features into consideration. We develop a novel deep reinforcement learning algorithm called Hierarchical Reinforcement Pricing (HRP), which builds upon the Deep Deterministic Policy Gradient algorithm. Different from existing methods that often ignore spatial information and rely heavily on accurate prediction, HRP captures both spatial and temporal dependencies using a divide-and-conquer structure with an embedded localized module. We conduct extensive experiments to evaluate HRP, based on a dataset from Mobike, a major Chinese dockless bike sharing company. Results show that HRP performs close to the 24-timeslot look-ahead optimization, and outperforms state-of-the-art methods in both service level and bike distribution. It also transfers well when applied to unseen areas.

Zhixuan Fang

12 Papers