Aaron Walsman

CV
h-index96
14papers
615citations
Novelty43%
AI Score43

14 Papers

CVJul 27, 2022Code
Break and Make: Interactive Structural Understanding Using LEGO Bricks

Aaron Walsman, Muru Zhang, Klemen Kotar et al.

Visual understanding of geometric structures with complex spatial relationships is a fundamental component of human intelligence. As children, we learn how to reason about structure not only from observation, but also by interacting with the world around us -- by taking things apart and putting them back together again. The ability to reason about structure and compositionality allows us to not only build things, but also understand and reverse-engineer complex systems. In order to advance research in interactive reasoning for part-based geometric understanding, we propose a challenging new assembly problem using LEGO bricks that we call Break and Make. In this problem an agent is given a LEGO model and attempts to understand its structure by interactively inspecting and disassembling it. After this inspection period, the agent must then prove its understanding by rebuilding the model from scratch using low-level action primitives. In order to facilitate research on this problem we have built LTRON, a fully interactive 3D simulator that allows learning agents to assemble, disassemble and manipulate LEGO models. We pair this simulator with a new dataset of fan-made LEGO creations that have been uploaded to the internet in order to provide complex scenes containing over a thousand unique brick shapes. We take a first step towards solving this problem using sequence-to-sequence models that provide guidance for how to make progress on this challenging problem. Our simulator and data are available at github.com/aaronwalsman/ltron. Additional training code and PyTorch examples are available at github.com/aaronwalsman/ltron-torch-eccv22.

CVApr 5, 2023
ENTL: Embodied Navigation Trajectory Learner

Klemen Kotar, Aaron Walsman, Roozbeh Mottaghi

We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.

AIMay 28, 2025Code
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Tian Qin, Core Francisco Park, Mujin Kwun et al.

Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.

MAOct 21, 2025Code
The Emergence of Complex Behavior in Large-Scale Ecological Environments

Joseph Bejjani, Chase Van Amburg, Chengrui Wang et al.

We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and natural selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. In an effort to discover how complex behaviors naturally emerge, we conduct experiments in large-scale worlds that reach populations of more than 60,000 individual agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors, finding that some appear only in sufficiently large environments and populations, with larger scales increasing behavioral stability and consistency. While there is a rich history of research in evolutionary settings, our scaling results provide promising new directions to explore ecology as an instrument of machine learning in an era of abundant computational resources. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.

ROSep 28, 2020Code
Amodal 3D Reconstruction for Robotic Manipulation via Stability and Connectivity

William Agnew, Christopher Xie, Aaron Walsman et al.

Learning-based 3D object reconstruction enables single- or few-shot estimation of 3D object models. For robotics, this holds the potential to allow model-based methods to rapidly adapt to novel objects and scenes. Existing 3D reconstruction techniques optimize for visual reconstruction fidelity, typically measured by chamfer distance or voxel IOU. We find that when applied to realistic, cluttered robotics environments, these systems produce reconstructions with low physical realism, resulting in poor task performance when used for model-based control. We propose ARM, an amodal 3D reconstruction system that introduces (1) a stability prior over object shapes, (2) a connectivity prior, and (3) a multi-channel input representation that allows for reasoning over relationships between groups of objects. By using these priors over the physical properties of objects, our system improves reconstruction quality not just by standard visual metrics, but also performance of model-based control on a variety of robotics manipulation tasks in challenging, cluttered environments. Code is available at github.com/wagnew3/ARM.

ROMay 19, 2024
URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

Zoey Chen, Aaron Walsman, Marius Memmel et al.

Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.

NENov 11, 2025
Understanding Electro-communication and Electro-sensing in Weakly Electric Fish using Multi-Agent Deep Reinforcement Learning

Satpreet H. Singh, Sonja Johnson-Yu, Zhouyang Lu et al.

Weakly electric fish, like Gnathonemus petersii, use a remarkable electrical modality for active sensing and communication, but studying their rich electrosensing and electrocommunication behavior and associated neural activity in naturalistic settings remains experimentally challenging. Here, we present a novel biologically-inspired computational framework to study these behaviors, where recurrent neural network (RNN) based artificial agents trained via multi-agent reinforcement learning (MARL) learn to modulate their electric organ discharges (EODs) and movement patterns to collectively forage in virtual environments. Trained agents demonstrate several emergent features consistent with real fish collectives, including heavy tailed EOD interval distributions, environmental context dependent shifts in EOD interval distributions, and social interaction patterns like freeloading, where agents reduce their EOD rates while benefiting from neighboring agents' active sensing. A minimal two-fish assay further isolates the role of electro-communication, showing that access to conspecific EODs and relative dominance jointly shape foraging success. Notably, these behaviors emerge through evolution-inspired rewards for individual fitness and emergent inter-agent interactions, rather than through rewarding agents explicitly for social interactions. Our work has broad implications for the neuroethology of weakly electric fish, as well as other social, communicating animals in which extensive recordings from multiple individuals, and thus traditional data-driven modeling, are infeasible.

AIFeb 26, 2025
Data-Efficient Multi-Agent Spatial Planning with LLMs

Huangyuan Su, Aaron Walsman, Daniel Garces et al.

In this project, our goal is to determine how to leverage the world-knowledge of pretrained large language models for efficient and robust learning in multiagent decision making. We examine this in a taxi routing and assignment problem where agents must decide how to best pick up passengers in order to minimize overall waiting time. While this problem is situated on a graphical road network, we show that with the proper prompting zero-shot performance is quite strong on this task. Furthermore, with limited fine-tuning along with the one-at-a-time rollout algorithm for look ahead, LLMs can out-compete existing approaches with 50 times fewer environmental interactions. We also explore the benefits of various linguistic prompting approaches and show that including certain easy-to-compute information in the prompt significantly improves performance. Finally, we highlight the LLM's built-in semantic understanding, showing its ability to adapt to environmental factors through simple prompts.

CVJul 6, 2020
FLUID: A Unified Evaluation Framework for Flexible Sequential Data

Matthew Wallingford, Aditya Kusupati, Keivan Alizadeh-Vahid et al.

Modern ML methods excel when training data is IID, large-scale, and well labeled. Learning in less ideal conditions remains an open challenge. The sub-fields of few-shot, continual, transfer, and representation learning have made substantial strides in learning under adverse conditions; each affording distinct advantages through methods and insights. These methods address different challenges such as data arriving sequentially or scarce training examples, however often the difficult conditions an ML system will face over its lifetime cannot be anticipated prior to deployment. Therefore, general ML systems which can handle the many challenges of learning in practical settings are needed. To foster research towards the goal of general ML methods, we introduce a new unified evaluation framework - FLUID (Flexible Sequential Data). FLUID integrates the objectives of few-shot, continual, transfer, and representation learning while enabling comparison and integration of techniques across these subfields. In FLUID, a learner faces a stream of data and must make sequential predictions while choosing how to update itself, adapt quickly to novel classes, and deal with changing data distributions; while accounting for the total amount of compute. We conduct experiments on a broad set of methods which shed new insight on the advantages and limitations of current solutions and indicate new research problems to solve. As a starting point towards more general methods, we present two new baselines which outperform other evaluated methods on FLUID. Project page: https://raivn.cs.washington.edu/projects/FLUID/.

CVAug 5, 2019
Part Segmentation for Highly Accurate Deformable Tracking in Occlusions via Fully Convolutional Neural Networks

Weilin Wan, Aaron Walsman, Dieter Fox

Successfully tracking the human body is an important perceptual challenge for robots that must work around people. Existing methods fall into two broad categories: geometric tracking and direct pose estimation using machine learning. While recent work has shown direct estimation techniques can be quite powerful, geometric tracking methods using point clouds can provide a very high level of 3D accuracy which is necessary for many robotic applications. However these approaches can have difficulty in clutter when large portions of the subject are occluded. To overcome this limitation, we propose a solution based on fully convolutional neural networks (FCN). We develop an optimized Fast-FCN network architecture for our application which allows us to filter observed point clouds and improve tracking accuracy while maintaining interactive frame rates. We also show that this model can be trained with a limited number of examples and almost no manual labelling by using an existing geometric tracker and data augmentation to automatically generate segmentation maps. We demonstrate the accuracy of our full system by comparing it against an existing geometric tracker, and show significant improvement in these challenging scenarios.

CVNov 21, 2018
Early Fusion for Goal Directed Robotic Vision

Aaron Walsman, Yonatan Bisk, Saadia Gabriel et al.

Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent's current goal. In this work, we flip this paradigm, by introducing EarlyFusion vision models that condition on a goal to build custom representations for downstream tasks. We show that these goal specific representations can be learned more quickly, are substantially more parameter efficient, and more robust than existing attention mechanisms in our domain. We demonstrate the effectiveness of these methods on a simulated robotic item retrieval problem that is trained in a fully end-to-end manner via imitation learning.

AIJan 23, 2018
CHALET: Cornell House Agent Learning Environment

Claudia Yan, Dipendra Misra, Andrew Bennnett et al.

We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.

CVNov 21, 2017
Dynamic High Resolution Deformable Articulated Tracking

Aaron Walsman, Weilin Wan, Tanner Schmidt et al.

The last several years have seen significant progress in using depth cameras for tracking articulated objects such as human bodies, hands, and robotic manipulators. Most approaches focus on tracking skeletal parameters of a fixed shape model, which makes them insufficient for applications that require accurate estimates of deformable object surfaces. To overcome this limitation, we present a 3D model-based tracking system for articulated deformable objects. Our system is able to track human body pose and high resolution surface contours in real time using a commodity depth sensor and GPU hardware. We implement this as a joint optimization over a skeleton to account for changes in pose, and over the vertices of a high resolution mesh to track the subject's shape. Through experimental results we show that we are able to capture dynamic sub-centimeter surface detail such as folds and wrinkles in clothing. We also show that this shape estimation aids kinematic pose estimation by providing a more accurate target to match against the point cloud. The end result is highly accurate spatiotemporal and semantic information which is well suited for physical human robot interaction as well as virtual and augmented reality systems.

ROFeb 10, 2015
Benchmarking in Manipulation Research: The YCB Object and Model Set and Benchmarking Protocols

Berk Calli, Aaron Walsman, Arjun Singh et al.

In this paper we present the Yale-CMU-Berkeley (YCB) Object and Model set, intended to be used to facilitate benchmarking in robotic manipulation, prosthetic design and rehabilitation research. The objects in the set are designed to cover a wide range of aspects of the manipulation problem; it includes objects of daily life with different shapes, sizes, textures, weight and rigidity, as well as some widely used manipulation tests. The associated database provides high-resolution RGBD scans, physical properties, and geometric models of the objects for easy incorporation into manipulation and planning software platforms. In addition to describing the objects and models in the set along with how they were chosen and derived, we provide a framework and a number of example task protocols, laying out how the set can be used to quantitatively evaluate a range of manipulation approaches including planning, learning, mechanical design, control, and many others. A comprehensive literature survey on existing benchmarks and object datasets is also presented and their scope and limitations are discussed. The set will be freely distributed to research groups worldwide at a series of tutorials at robotics conferences, and will be otherwise available at a reasonable purchase cost. It is our hope that the ready availability of this set along with the ground laid in terms of protocol templates will enable the community of manipulation researchers to more easily compare approaches as well as continually evolve benchmarking tests as the field matures.