Max Rudolph

LG
h-index71
5papers
33citations
Novelty51%
AI Score30

5 Papers

LGFeb 13, 2025Code
Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour et al.

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

AIDec 7, 2024
RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo et al.

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

LGMar 25, 2024
Learning Action-based Representations Using Invariance

Max Rudolph, Caleb Chuck, Kevin Black et al.

Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.

ROMay 6, 2024
Robot Air Hockey: A Manipulation Testbed for Robot Learning with Reinforcement Learning

Caleb Chuck, Carl Qi, Michael J. Munje et al.

Reinforcement Learning is a promising tool for learning complex policies even in fast-moving and object-interactive domains where human teleoperation or hard-coded policies might fail. To effectively reflect this challenging category of tasks, we introduce a dynamic, interactive RL testbed based on robot air hockey. By augmenting air hockey with a large family of tasks ranging from easy tasks like reaching, to challenging ones like pushing a block by hitting it with a puck, as well as goal-based and human-interactive tasks, our testbed allows a varied assessment of RL capabilities. The robot air hockey testbed also supports sim-to-real transfer with three domains: two simulators of increasing fidelity and a real robot system. Using a dataset of demonstration data gathered through two teleoperation systems: a virtualized control environment, and human shadowing, we assess the testbed with behavior cloning, offline RL, and RL from scratch.

ROAug 1, 2021
Desperate Times Call for Desperate Measures: Towards Risk-Adaptive Task Allocation

Max Rudolph, Sonia Chernova, Harish Ravichandar

Multi-robot task allocation (MRTA) problems involve optimizing the allocation of robots to tasks. MRTA problems are known to be challenging when tasks require multiple robots and the team is composed of heterogeneous robots. These challenges are further exacerbated when we need to account for uncertainties encountered in the real-world. In this work, we address coalition formation in heterogeneous multi-robot teams with uncertain capabilities. We specifically focus on tasks that require coalitions to collectively satisfy certain minimum requirements. Existing approaches to uncertainty-aware task allocation either maximize expected pay-off (risk-neutral approaches) or improve worst-case or near-worst-case outcomes (risk-averse approaches). Within the context of our problem, we demonstrate the inherent limitations of unilaterally ignoring or avoiding risk and show that these approaches can in fact reduce the probability of satisfying task requirements. Inspired by models that explain foraging behaviors in animals, we develop a risk-adaptive approach to task allocation. Our approach adaptively switches between risk-averse and risk-seeking behavior in order to maximize the probability of satisfying task requirements. Comprehensive numerical experiments conclusively demonstrate that our risk-adaptive approach outperforms risk-neutral and risk-averse approaches. We also demonstrate the effectiveness of our approach using a simulated multi-robot emergency response scenario.