Ishika Singh

RO
h-index21
11papers
2,229citations
Novelty51%
AI Score53

11 Papers

ROSep 22, 2022
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian et al. · gatech, nvidia

Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io

ROMay 26
Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Jeremy Morgan, Prajwal Vijay, Hyeonho Oh et al.

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

ROFeb 13, 2024Code
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Wilbert Pumacay, Ishika Singh, Jiafei Duan et al. · uw

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 14 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, physical properties perturbations and camera pose. Using THE COLOSSEUM, we compare 5 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the success rate degrades $\geq$75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

CLNov 28, 2023
Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Wang Zhu, Ishika Singh, Yuan Huang et al.

Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

LGJun 16, 2020Code
Differentially-private Federated Neural Architecture Search

Ishika Singh, Haoyi Zhou, Kunlin Yang et al.

Neural architecture search, which aims to automatically search for architectures (e.g., convolution, max pooling) of neural networks that maximize validation performance, has achieved remarkable progress recently. In many application scenarios, several parties would like to collaboratively search for a shared neural architecture by leveraging data from all parties. However, due to privacy concerns, no party wants its data to be seen by other parties. To address this problem, we propose federated neural architecture search (FNAS), where different parties collectively search for a differentiable architecture by exchanging gradients of architecture variables without exposing their data to other parties. To further preserve privacy, we study differentially-private FNAS (DP-FNAS), which adds random noise to the gradients of architecture variables. We provide theoretical guarantees of DP-FNAS in achieving differential privacy. Experiments show that DP-FNAS can search highly-performant neural architectures while protecting the privacy of individual parties. The code is available at https://github.com/UCSD-AI4H/DP-FNAS

AIMar 25, 2024
TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models

David Bai, Ishika Singh, David Traum et al.

Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, such as concurrent actions between two agents when there are no conflicting conditions, without significant modification and definition to existing PDDL domains. A human expert aware of such constraints can decompose a goal into subgoals, each reachable through single agent planning, to take advantage of simultaneous actions. In contrast to classical planning, large language models (LLMs) directly used for inferring plan steps rarely guarantee execution success, but are capable of leveraging commonsense reasoning to assemble action sequences. We combine the strengths of both classical planning and LLMs by approximating human intuitions for multi-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone, as well as most multiagent plans, while guaranteeing execution success. Additionally, we find that LLM-based approximations of subgoals result in similar multi-agent execution lengths to those specified by human experts. Website and resources at https://glamor-usc.github.io/twostep

ROJun 1, 2025
OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Ishika Singh, Ankit Goyal, Stan Birchfield et al. · nvidia

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

ROJun 25, 2025
PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models

Wang Bill Zhu, Miaosen Chai, Ishika Singh et al.

We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.

AIJun 4, 2024
Language Models can Infer Action Semantics for Symbolic Planners from Environment Feedback

Wang Zhu, Ishika Singh, Robin Jia et al.

Symbolic planners can discover a sequence of actions from initial to goal states given expert-defined, domain-specific logical action semantics. Large Language Models (LLMs) can directly generate such sequences, but limitations in reasoning and state-tracking often result in plans that are insufficient or unexecutable. We propose Predicting Semantics of Actions with Language Models (PSALM), which automatically learns action semantics by leveraging the strengths of both symbolic planners and LLMs. PSALM repeatedly proposes and executes plans, using the LLM to partially generate plans and to infer domain-specific action semantics based on execution outcomes. PSALM maintains a belief over possible action semantics that is iteratively updated until a goal state is reached. Experiments on 7 environments show that when learning just from one goal, PSALM boosts plan success rate from 36.4% (on Claude-3.5) to 100%, and explores the environment more efficiently than prior work to infer ground truth domain action semantics.

CLJul 18, 2021
Pre-trained Language Models as Prior Knowledge for Playing Text-based Games

Ishika Singh, Gargi Singh, Ashutosh Modi

Recently, text world games have been proposed to enable artificial agents to understand and reason about real-world scenarios. These text-based games are challenging for artificial agents, as it requires an understanding of and interaction using natural language in a partially observable environment. Agents observe the environment via textual descriptions designed to be challenging enough for even human players. Past approaches have not paid enough attention to the language understanding capability of the proposed agents. Typically, these approaches train from scratch, an agent that learns both textual representations and the gameplay online during training using a temporal loss function. Given the sample-inefficiency of RL approaches, it is inefficient to learn rich enough textual representations to be able to understand and reason using the textual observation in such a complicated game environment setting. In this paper, we improve the semantic understanding of the agent by proposing a simple RL with LM framework where we use transformer-based language models with Deep RL models. We perform a detailed study of our framework to demonstrate how our model outperforms all existing agents on the popular game, Zork1, to achieve a score of 44.7, which is 1.6 higher than the state-of-the-art model. Overall, our proposed approach outperforms 4 games out of the 14 text-based games, while performing comparable to the state-of-the-art models on the remaining games.

CLNov 8, 2020
Adapting a Language Model for Controlled Affective Text Generation

Ishika Singh, Ahsan Barkati, Tushar Goswamy et al.

Human use language not just to convey information but also to express their inner feelings and mental states. In this work, we adapt the state-of-the-art language generation models to generate affective (emotional) text. We posit a model capable of generating affect-driven and topic-focused sentences without losing grammatical correctness as the affect intensity increases. We propose to incorporate emotion as prior for the probabilistic state-of-the-art text generation model such as GPT-2. The model gives a user the flexibility to control the category and intensity of emotion as well as the topic of the generated text. Previous attempts at modelling fine-grained emotions fall out on grammatical correctness at extreme intensities, but our model is resilient to this and delivers robust results at all intensities. We conduct automated evaluations and human studies to test the performance of our model and provide a detailed comparison of the results with other models. In all evaluations, our model outperforms existing affective text generation models.