LGJul 5, 2022
Approximating Discontinuous Nash Equilibrial Values of Two-Player General-Sum Differential GamesLei Zhang, Mukesh Ghimire, Wenlong Zhang et al.
Finding Nash equilibrial policies for two-player differential games requires solving Hamilton-Jacobi-Isaacs (HJI) PDEs. Self-supervised learning has been used to approximate solutions of such PDEs while circumventing the curse of dimensionality. However, this method fails to learn discontinuous PDE solutions due to its sampling nature, leading to poor safety performance of the resulting controllers in robotics applications when player rewards are discontinuous. This paper investigates two potential solutions to this problem: a hybrid method that leverages both supervised Nash equilibria and the HJI PDE, and a value-hardening method where a sequence of HJIs are solved with a gradually hardening reward. We compare these solutions using the resulting generalization and safety performance in two vehicle interaction simulation studies with 5D and 9D state spaces, respectively. Results show that with informative supervision (e.g., collision and near-collision demonstrations) and the low cost of self-supervised learning, the hybrid method achieves better safety performance than the supervised, self-supervised, and value hardening approaches on equal computational budget. Value hardening fails to generalize in the higher-dimensional case without informative supervision. Lastly, we show that the neural activation function needs to be continuously differentiable for learning PDEs and its choice can be case dependent.
CLJan 8Code
PRISM: A Unified Framework for Post-Training LLMs Without Verifiable RewardsMukesh Ghimire, Aosong Feng, Liwen You et al.
Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check. Code available at https://github.com/ghimiremukesh/PRISM.
RONov 28, 2023
Value Approximation for Two-Player General-Sum Differential Games with State ConstraintsLei Zhang, Mukesh Ghimire, Wenlong Zhang et al.
Solving Hamilton-Jacobi-Isaacs (HJI) PDEs numerically enables equilibrial feedback control in two-player differential games, yet faces the curse of dimensionality (CoD). While physics-informed neural networks (PINNs) have shown promise in alleviating CoD in solving PDEs, vanilla PINNs fall short in learning discontinuous solutions due to their sampling nature, leading to poor safety performance of the resulting policies when values are discontinuous due to state or temporal logic constraints. In this study, we explore three potential solutions to this challenge: (1) a hybrid learning method that is guided by both supervisory equilibria and the HJI PDE, (2) a value-hardening method where a sequence of HJIs are solved with increasing Lipschitz constant on the constraint violation penalty, and (3) the epigraphical technique that lifts the value to a higher dimensional state space where it becomes continuous. Evaluations through 5D and 9D vehicle and 13D drone simulations reveal that the hybrid method outperforms others in terms of generalization and safety performance by taking advantage of both the supervisory equilibrium values and costates, and the low cost of PINN loss gradients.
GTMay 4
Fast Strategy Solving for the Informed Player in Two-Player Zero-Sum Linear-Quadratic Differential Games with One-Sided InformationMukesh Ghimire, Zhe Xu, Yi Ren
We study finite-horizon two-player zero-sum differential games with one-sided payoff information ($G$), where the informed player (P1) knows the game payoff, while P2 only has a public belief over a finite set of possible payoffs. In this case, P1's Nash equilibrium (NE) behavioral strategy may control the release of the type information or even resort to manipulate P2's belief. Previous studies revealed an atomic structure of the NE of $G$ with general nonlinear dynamics and payoffs, leading to tractable NE approximation. Implementing such approximation schemes for real-time sub-game solving, however, has not been achieved, yet is desired for applications where sim-to-real gaps exist and robust control is required. This paper improves the computational efficiency of sub-game solving for P1 during $G$ with linear dynamics and quadratic losses. Specifically, we show that P1's NE computation can be formulated as a bi-level optimization problem where the outer level optimizes the "signaling" strategy, i.e., when and how to reveal information through control, and the inner level is a game-tree LQR that solves for the optimal closed-loop control. This bi-level problem is solved via an adjoint-enabled backpropagation scheme: A "backward" LQR pass is followed by a "forward" gradient descent pass for improving the signaling. We apply the proposed algorithm to approximate NEs for variants of a homing problem with a 8D state space, 2D action spaces, and a discrete time horizon of $K=10$. The algorithm achieves $\approx$10Hz sub-game solving, enabling robust game-theoretic planning under information asymmetry and random disturbances.
LGJan 3, 2024
Pontryagin Neural Operator for Solving Parametric General-Sum Differential GamesLei Zhang, Mukesh Ghimire, Zhe Xu et al.
The values of two-player general-sum differential games are viscosity solutions to Hamilton-Jacobi-Isaacs (HJI) equations. Value and policy approximations for such games suffer from the curse of dimensionality (CoD). Alleviating CoD through physics-informed neural networks (PINN) encounters convergence issues when differentiable values with large Lipschitz constants are present due to state constraints. On top of these challenges, it is often necessary to learn generalizable values and policies across a parametric space of games, e.g., for game parameter inference when information is incomplete. To address these challenges, we propose in this paper a Pontryagin-mode neural operator that outperforms the current state-of-the-art hybrid PINN model on safety performance across games with parametric state constraints. Our key contribution is the introduction of a costate loss defined on the discrepancy between forward and backward costate rollouts, which are computationally cheap. We show that the costate dynamics, which can reflect state constraint violation, effectively enables the learning of differentiable values with large Lipschitz constants, without requiring manually supervised data as suggested by the hybrid PINN model. More importantly, we show that the close relationship between costates and policies makes the former critical in learning feedback control policies with generalizable safety performance.
GTMar 5, 2024
State-Constrained Zero-Sum Differential Games with One-Sided InformationMukesh Ghimire, Lei Zhang, Zhe Xu et al.
We study zero-sum differential games with state constraints and one-sided information, where the informed player (Player 1) has a categorical payoff type unknown to the uninformed player (Player 2). The goal of Player 1 is to minimize his payoff without violating the constraints, while that of Player 2 is to violate the state constraints if possible, or to maximize the payoff otherwise. One example of the game is a man-to-man matchup in football. Without state constraints, Cardaliaguet (2007) showed that the value of such a game exists and is convex to the common belief of players. Our theoretical contribution is an extension of this result to games with state constraints and the derivation of the primal and dual subdynamic principles necessary for computing behavioral strategies. Different from existing works that are concerned about the scalability of no-regret learning in games with discrete dynamics, our study reveals the underlying structure of strategies for belief manipulation resulting from information asymmetry and state constraints. This structure will be necessary for scalable learning on games with continuous actions and long time windows. We use a simplified football game to demonstrate the utility of this work, where we reveal player positions and belief states in which the attacker should (or should not) play specific random deceptive moves to take advantage of information asymmetry, and compute how the defender should respond.
RODec 24, 2021
Lane Change Decision-Making through Deep Reinforcement LearningMukesh Ghimire, Malobika Roy Choudhury, Guna Sekhar Sai Harsha Lagudu
Due to the complexity and volatility of the traffic environment, decision-making in autonomous driving is a significantly hard problem. In this project, we use a Deep Q-Network, along with rule-based constraints to make lane-changing decision. A safe and efficient lane change behavior may be obtained by combining high-level lateral decision-making with low-level rule-based trajectory monitoring. The agent is anticipated to perform appropriate lane-change maneuvers in a real-world-like udacity simulator after training it for a total of 100 episodes. The results shows that the rule-based DQN performs better than the DQN method. The rule-based DQN achieves a safety rate of 0.8 and average speed of 47 MPH