35.0CLApr 28
Limited Linguistic Diversity in Embodied AI DatasetsSelma Wanna, Agnes Luhtaru, Jonathan Salfity et al.
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions--including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
CVJul 19, 2024
The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox EnvironmentsShivansh Sharma, Mathew Huang, Sanat Nair et al.
Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions. Developing these techniques necessitates algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly. Although substantial efforts have curated datasets for hand segmentation, most focus on residential or commercial domains. Existing datasets targeting industrial settings predominantly rely on synthetic data, which we demonstrate does not effectively transfer to real-world operations. Moreover, these datasets lack uncertainty estimations critical for safe collaboration. Addressing these gaps, we present HAGS: Hand and Glove Segmentation Dataset. This dataset provides challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios as well as assess out-of-distribution images, constructed via green screen augmentations, to determine ML-classifier robustness. We study state-of-the-art, real-time segmentation models to evaluate existing methods. Our dataset and baselines are publicly available.
ROMar 25, 2024
Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasksJonathan Salfity, Selma Wanna, Minkyu Choi et al.
Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.
ROFeb 18, 2022
Smooth time optimal trajectory generation for dronesSrinath Tankasala, Can Pehlivanturk, Efstathios Bakolas et al.
In this paper, we address a minimum-time steering problem for a drone modeled as point mass with bounded acceleration, across a set of desired waypoints in the presence of gravity. We first provide a method to solve for the minimum-time control input that will steer the point mass between two waypoints based on a continuous-time problem formulation which we address by using Pontryagin's Minimum Principle. Subsequently, we solve for the time-optimal trajectory across the given set of waypoints by discretizing in the time domain and formulating the minimum-time problem as a nonlinear program (NLP). The velocities at each waypoint obtained from solving the NLP in the discretized domain are then used as boundary conditions to extend our two-point solution across those multiple waypoints. We apply this planning methodology to execute a surveying task that minimizes the time taken to completely explore a target area or volume. Numerical simulations and theoretical analyses of this new planning methodology are presented. The results from our approach are also compared to traditional polynomial trajectories like minimum snap planning.
ROJul 25, 2019
TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task TransferAdam Allevato, Elaine Schaertl Short, Mitch Pryor et al.
As researchers teach robots to perform more and more complex tasks, the need for realistic simulation environments is growing. Existing techniques for closing the reality gap by approximating real-world physics often require extensive real world data and/or thousands of simulation samples. This paper presents TuneNet, a new machine learning-based method to directly tune the parameters of one model to match another using an *iterative residual tuning* technique. TuneNet estimates the parameter difference between two models using a single observation from the target and minimal simulation, allowing rapid, accurate and sample-efficient parameter estimation. The system can be trained via supervised learning over an auto-generated simulated dataset. We show that TuneNet can perform system identification, even when the true parameter values lie well outside the distribution seen during training, and demonstrate that simulators tuned with TuneNet outperform existing techniques for predicting rigid body motion. Finally, we show that our method can estimate real-world parameter values, allowing a robot to perform sim-to-real task transfer on a dynamic manipulation task unseen during training. Code and videos are available online at http://bit.ly/2lf1bAw.