ROJun 29, 2023
Principles and Guidelines for Evaluating Social Robot Navigation AlgorithmsAnthony Francis, Claudia Pérez-D'Arpino, Chengshu Li et al. · cmu, mit
A major challenge to deploying robots widely is navigation in human-populated environments, commonly referred to as social robot navigation. While the field of social navigation has advanced tremendously in recent years, the fair evaluation of algorithms that tackle social navigation remains hard because it involves not just robotic agents moving in static environments but also dynamic human agents and their perceptions of the appropriateness of robot behavior. In contrast, clear, repeatable, and accessible benchmarks have accelerated progress in fields like computer vision, natural language processing and traditional robot navigation by enabling researchers to fairly compare algorithms, revealing limitations of existing solutions and illuminating promising new directions. We believe the same approach can benefit social navigation. In this paper, we pave the road towards common, widely accessible, and repeatable benchmarking criteria to evaluate social robot navigation. Our contributions include (a) a definition of a socially navigating robot as one that respects the principles of safety, comfort, legibility, politeness, social competency, agent understanding, proactivity, and responsiveness to context, (b) guidelines for the use of metrics, development of scenarios, benchmarks, datasets, and simulators to evaluate social navigation, and (c) a design of a social navigation metrics framework to make it easier to compare results from different simulators, robots and datasets.
ROSep 19, 2022
Gesture2Path: Imitation Learning for Gesture-aware NavigationCatie Cuan, Edward Lee, Emre Fisher et al.
As robots increasingly enter human-centered environments, they must not only be able to navigate safely around humans, but also adhere to complex social norms. Humans often rely on non-verbal communication through gestures and facial expressions when navigating around other people, especially in densely occupied spaces. Consequently, robots also need to be able to interpret gestures as part of solving social navigation tasks. To this end, we present Gesture2Path, a novel social navigation approach that combines image-based imitation learning with model-predictive control. Gestures are interpreted based on a neural network that operates on streams of images, while we use a state-of-the-art model predictive control algorithm to solve point-to-point navigation tasks. We deploy our method on real robots and showcase the effectiveness of our approach for the four gestures-navigation scenarios: left/right, follow me, and make a circle. Our experiments indicate that our method is able to successfully interpret complex human gestures and to use them as a signal to generate socially compliant trajectories for navigation tasks. We validated our method based on in-situ ratings of participants interacting with the robots.
CVOct 14, 2022
Instance Segmentation with Cross-Modal ConsistencyAlex Zihao Zhu, Vincent Casser, Reza Mahjourian et al.
Segmenting object instances is a key task in machine perception, with safety-critical applications in robotics and autonomous driving. We introduce a novel approach to instance segmentation that jointly leverages measurements from multiple sensor modalities, such as cameras and LiDAR. Our method learns to predict embeddings for each pixel or point that give rise to a dense segmentation of the scene. Specifically, our technique applies contrastive learning to points in the scene both across sensor modalities and the temporal domain. We demonstrate that this formulation encourages the models to learn embeddings that are invariant to viewpoint variations and consistent across sensor modalities. We further demonstrate that the embeddings are stable over time as objects move around the scene. This not only provides stable instance masks, but can also provide valuable signals to downstream tasks, such as object tracking. We evaluate our method on the Cityscapes and KITTI-360 datasets. We further conduct a number of ablation studies, demonstrating benefits when applying additional inputs for the contrastive loss.
CVJul 28, 2024
Perm: A Parametric Representation for Multi-Style 3D Hair ModelingChengan He, Xin Sun, Zhixin Shu et al.
We present Perm, a learned parametric representation of human 3D hair designed to facilitate various hair-related applications. Unlike previous work that jointly models the global hair structure and local curl patterns, we propose to disentangle them using a PCA-based strand representation in the frequency domain, thereby allowing more precise editing and output control. Specifically, we leverage our strand representation to fit and decompose hair geometry textures into low- to high-frequency hair structures, termed guide textures and residual textures, respectively. These decomposed textures are later parameterized with different generative models, emulating common stages in the hair grooming process. We conduct extensive experiments to validate the architecture design of Perm, and finally deploy the trained model as a generic prior to solve task-agnostic problems, further showcasing its flexibility and superiority in tasks such as single-view hair reconstruction, hairstyle editing, and hair-conditioned image generation. More details can be found on our project page: https://cs.yale.edu/homes/che/projects/perm/.
CVDec 21, 2023Code
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL FinetuningDesai Xie, Jiahao Li, Hao Tan et al.
Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.
CVMay 13
SynVA: A Modular Toolkit for Vessel Generation and Aneurysm EditingMarten J. Finck, Niklas C. Koser, Sarker M. Mahfuz et al.
Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.
CVMar 27, 2024
Generating Diverse Agricultural Data for Vision-Based Farming ApplicationsMikolaj Cieslak, Umabharathi Govindarajan, Alejandro Garcia et al.
We present a specialized procedural model for generating synthetic agricultural scenes, focusing on soybean crops, along with various weeds. This model is capable of simulating distinct growth stages of these plants, diverse soil conditions, and randomized field arrangements under varying lighting conditions. The integration of real-world textures and environmental factors into the procedural generation process enhances the photorealism and applicability of the synthetic data. Our dataset includes 12,000 images with semantic labels, offering a comprehensive resource for computer vision tasks in precision agriculture, such as semantic segmentation for autonomous weed control. We validate our model's effectiveness by comparing the synthetic data against real agricultural images, demonstrating its potential to significantly augment training data for machine learning models in agriculture. This approach not only provides a cost-effective solution for generating high-quality, diverse data but also addresses specific needs in agricultural vision tasks that are not fully covered by general-purpose models.
CVMar 31, 2024
LAESI: Leaf Area Estimation with Synthetic ImageryJacek Kałużny, Yannik Schreckenberg, Karol Cyganik et al.
We introduce LAESI, a Synthetic Leaf Dataset of 100,000 synthetic leaf images on millimeter paper, each with semantic masks and surface area labels. This dataset provides a resource for leaf morphology analysis primarily aimed at beech and oak leaves. We evaluate the applicability of the dataset by training machine learning models for leaf surface area prediction and semantic segmentation, using real images for validation. Our validation shows that these models can be trained to predict leaf surface area with a relative error not greater than an average human annotator. LAESI also provides an efficient framework based on 3D procedural models and generative AI for the large-scale, controllable generation of data with potential further applications in agriculture and biology. We evaluate the inclusion of generative AI in our procedural data generation pipeline and show how data filtering based on annotation consistency results in datasets which allow training the highest performing vision models.
GRApr 25, 2024
One Noise to Rule Them All: Learning a Unified Model of Spatially-Varying Noise PatternsArman Maesumi, Dylan Hu, Krishi Saripalli et al.
Procedural noise is a fundamental component of computer graphics pipelines, offering a flexible way to generate textures that exhibit "natural" random variation. Many different types of noise exist, each produced by a separate algorithm. In this paper, we present a single generative model which can learn to generate multiple types of noise as well as blend between them. In addition, it is capable of producing spatially-varying noise blends despite not having access to such data for training. These features are enabled by training a denoising diffusion model using a novel combination of data augmentation and network conditioning techniques. Like procedural noise generators, the model's behavior is controllable via interpretable parameters and a source of randomness. We use our model to produce a variety of visually compelling noise textures. We also present an application of our model to improving inverse procedural material design; using our model in place of fixed-type noise nodes in a procedural material graph results in higher-fidelity material reconstructions without needing to know the type of noise in advance.
ROFeb 4, 2024
Gazebo Plants: Simulating Plant-Robot Interaction with Cosserat RodsJunchen Deng, Samhita Marri, Jonathan Klein et al.
Robotic harvesting has the potential to positively impact agricultural productivity, reduce costs, improve food quality, enhance sustainability, and to address labor shortage. In the rapidly advancing field of agricultural robotics, the necessity of training robots in a virtual environment has become essential. Generating training data to automatize the underlying computer vision tasks such as image segmentation, object detection and classification, also heavily relies on such virtual environments as synthetic data is often required to overcome the shortage and lack of variety of real data sets. However, physics engines commonly employed within the robotics community, such as ODE, Simbody, Bullet, and DART, primarily support motion and collision interaction of rigid bodies. This inherent limitation hinders experimentation and progress in handling non-rigid objects such as plants and crops. In this contribution, we present a plugin for the Gazebo simulation platform based on Cosserat rods to model plant motion. It enables the simulation of plants and their interaction with the environment. We demonstrate that, using our plugin, users can conduct harvesting simulations in Gazebo by simulating a robotic arm picking fruits and achieve results comparable to real-world experiments.
CRSep 11, 2025
Prompt Pirates Need a Map: Stealing Seeds helps Stealing PromptsFelix Mächtle, Ashwath Shetty, Jonas Sander et al.
Diffusion models have significantly advanced text-to-image generation, enabling the creation of highly realistic images conditioned on textual prompts and seeds. Given the considerable intellectual and economic value embedded in such prompts, prompt theft poses a critical security and privacy concern. In this paper, we investigate prompt-stealing attacks targeting diffusion models. We reveal that numerical optimization-based prompt recovery methods are fundamentally limited as they do not account for the initial random noise used during image generation. We identify and exploit a noise-generation vulnerability (CWE-339), prevalent in major image-generation frameworks, originating from PyTorch's restriction of seed values to a range of $2^{32}$ when generating the initial random noise on CPUs. Through a large-scale empirical analysis conducted on images shared via the popular platform CivitAI, we demonstrate that approximately 95% of these images' seed values can be effectively brute-forced in 140 minutes per seed using our seed-recovery tool, SeedSnitch. Leveraging the recovered seed, we propose PromptPirate, a genetic algorithm-based optimization method explicitly designed for prompt stealing. PromptPirate surpasses state-of-the-art methods, i.e., PromptStealer, P2HP, and CLIP-Interrogator, achieving an 8-11% improvement in LPIPS similarity. Furthermore, we introduce straightforward and effective countermeasures that render seed stealing, and thus optimization-based prompt stealing, ineffective. We have disclosed our findings responsibly and initiated coordinated mitigation efforts with the developers to address this critical vulnerability.
CVJun 13, 2024
LRM-Zero: Training Large Reconstruction Models with Synthesized DataDesai Xie, Sai Bi, Zhixin Shu et al.
We present LRM-Zero, a Large Reconstruction Model (LRM) trained entirely on synthesized 3D data, achieving high-quality sparse-view 3D reconstruction. The core of LRM-Zero is our procedural 3D dataset, Zeroverse, which is automatically synthesized from simple primitive shapes with random texturing and augmentations (e.g., height fields, boolean differences, and wireframes). Unlike previous 3D datasets (e.g., Objaverse) which are often captured or crafted by humans to approximate real 3D data, Zeroverse completely ignores realistic global semantics but is rich in complex geometric and texture details that are locally similar to or even more intricate than real objects. We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse. We also analyze several critical design choices of Zeroverse that contribute to LRM-Zero's capability and training stability. Our work demonstrates that 3D reconstruction, one of the core tasks in 3D vision, can potentially be addressed without the semantics of real-world objects. The Zeroverse's procedural synthesis code and interactive visualization are available at: https://desaixie.github.io/lrm-zero/.
LGFeb 5, 2024
A Lennard-Jones Layer for Distribution NormalizationMulun Na, Jonathan Klein, Biao Zhang et al.
We introduce the Lennard-Jones layer (LJL) for the equalization of the density of 2D and 3D point clouds through systematically rearranging points without destroying their overall structure (distribution normalization). LJL simulates a dissipative process of repulsive and weakly attractive interactions between individual points by considering the nearest neighbor of each point at a given moment in time. This pushes the particles into a potential valley, reaching a well-defined stable configuration that approximates an equidistant sampling after the stabilization process. We apply LJLs to redistribute randomly generated point clouds into a randomized uniform distribution. Moreover, LJLs are embedded in the generation process of point cloud networks by adding them at later stages of the inference process. The improvements in 3D point cloud generation utilizing LJLs are evaluated qualitatively and quantitatively. Finally, we apply LJLs to improve the point distribution of a score-based 3D point cloud denoising network. In general, we demonstrate that LJLs are effective for distribution normalization which can be applied at negligible cost without retraining the given neural network.
LGMay 9, 2023
DeepTree: Modeling Trees with Situated LatentsXiaochen Zhou, Bosheng Li, Bedrich Benes et al.
In this paper, we propose DeepTree, a novel method for modeling trees based on learning developmental rules for branching structures instead of manually defining them. We call our deep neural model situated latent because its behavior is determined by the intrinsic state -- encoded as a latent space of a deep neural model -- and by the extrinsic (environmental) data that is situated as the location in the 3D space and on the tree structure. We use a neural network pipeline to train a situated latent space that allows us to locally predict branch growth only based on a single node in the branch graph of a tree model. We use this representation to progressively develop new branch nodes, thereby mimicking the growth process of trees. Starting from a root node, a tree is generated by iteratively querying the neural network on the newly added nodes resulting in the branching structure of the whole tree. Our method enables generating a wide variety of tree shapes without the need to define intricate parameters that control their growth and behavior. Furthermore, we show that the situated latents can also be used to encode the environmental response of tree models, e.g., when trees grow next to obstacles. We validate the effectiveness of our method by measuring the similarity of our tree models and by procedurally generated ones based on a number of established metrics for tree form.
GRAug 12, 2020
Procedural Urban ForestryTill Niese, Sören Pirk, Matthias Albrecht et al.
The placement of vegetation plays a central role in the realism of virtual scenes. We introduce procedural placement models (PPMs) for vegetation in urban layouts. PPMs are environmentally sensitive to city geometry and allow identifying plausible plant positions based on structural and functional zones in an urban layout. PPMs can either be directly used by defining their parameters or can be learned from satellite images and land register data. Together with approaches for generating buildings and trees, this allows us to populate urban landscapes with complex 3D vegetation. The effectiveness of our framework is shown through examples of large-scale city scenes and close-ups of individually grown tree models; we also validate it by a perceptual user study.
CVJun 16, 2020
Domain Adaptation with Morphologic SegmentationJonathan Klein, Sören Pirk, Dominik L. Michels
We present a novel domain adaptation framework that uses morphologic segmentation to translate images from arbitrary input domains (real and synthetic) into a uniform output domain. Our framework is based on an established image-to-image translation pipeline that allows us to first transform the input image into a generalized representation that encodes morphology and semantics - the edge-plus-segmentation map (EPS) - which is then transformed into an output domain. Images transformed into the output domain are photo-realistic and free of artifacts that are commonly present across different real (e.g. lens flare, motion blur, etc.) and synthetic (e.g. unrealistic textures, simplified geometry, etc.) data sets. Our goal is to establish a preprocessing step that unifies data from multiple sources into a common representation that facilitates training downstream tasks in computer vision. This way, neural networks for existing tasks can be trained on a larger variety of training data, while they are also less affected by overfitting to specific data sets. We showcase the effectiveness of our approach by qualitatively and quantitatively evaluating our method on four data sets of simulated and real data of urban scenes. Additional results can be found on the project website available at http://jonathank.de/research/eps/ .
ROJun 8, 2020
Modeling Long-horizon Tasks as Sequential Interaction LandscapesSören Pirk, Karol Hausman, Alexander Toshev et al.
Complex object manipulation tasks often span over long sequences of operations. Task planning over long-time horizons is a challenging and open problem in robotics, and its complexity grows exponentially with an increasing number of subtasks. In this paper we present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos. We represent each subtask as an action symbol (e.g. move cup), and show that these symbols can be learned and predicted directly from image observations. Learning from demonstrations and visual observations are two main pillars of our approach. The former makes the learning tractable as it provides the network with information about the most frequent transitions and relevant dependency between subtasks (instead of exploring all possible combination), while the latter allows the network to continuously monitor the task progress and thus to interactively adapt to changes in the environment. We evaluate our framework on two long horizon tasks: (1) block stacking of puzzle pieces being executed by humans, and (2) a robot manipulation task involving pick and place of objects and sliding a cabinet door with a 7-DoF robot arm. We show that complex plans can be carried out when executing the robotic task and the robot can interactively adapt to changes in the environment and recover from failure cases.
COMP-PHJun 6, 2020
Accurately Solving Physical Systems with Graph LearningHan Shao, Tassilo Kugelstadt, Torsten Hädrich et al.
Iterative solvers are widely used to accurately simulate physical systems. These solvers require initial guesses to generate a sequence of improving approximate solutions. In this contribution, we introduce a novel method to accelerate iterative solvers for physical systems with graph networks (GNs) by predicting the initial guesses to reduce the number of iterations. Unlike existing methods that aim to learn physical systems in an end-to-end manner, our approach guarantees long-term stability and therefore leads to more accurate solutions. Furthermore, our method improves the run time performance of traditional iterative solvers. To explore our method we make use of position-based dynamics (PBD) as a common solver for physical systems and evaluate it by simulating the dynamics of elastic rods. Our approach is able to generalize across different initial conditions, discretizations, and realistic material properties. Finally, we demonstrate that our method also performs well when taking discontinuous effects into account such as collisions between individual rods. Finally, to illustrate the scalability of our approach, we simulate complex 3D tree models composed of over a thousand individual branch segments swaying in wind fields. A video showing dynamic results of our graph learning assisted simulations of elastic rods can be found on the project website available at http://computationalsciences.org/publications/shao-2021-physical-systems-graph-learning.html .
CVMay 14, 2020
Taskology: Utilizing Task Relations at ScaleYao Lu, Sören Pirk, Jan Dlabal et al.
Many computer vision tasks address the problem of scene understanding and are naturally interrelated e.g. object classification, detection, scene segmentation, depth estimation, etc. We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly, supervising each other through their known relationships via consistency losses. Furthermore, explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data, and allows training with additional unsupervised or simulated data. We demonstrate a distributed joint training algorithm with task-level parallelism, which affords a high degree of asynchronicity and robustness. This allows learning across multiple tasks, or with large amounts of input data, at scale. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds. We observe improved performance across these tasks, especially in the low-label regime.
ROJun 21, 2019
Data-Efficient Learning for Sim-to-Real Robotic Grasping using Deep Point Cloud Prediction NetworksXinchen Yan, Mohi Khansari, Jasmine Hsu et al.
Training a deep network policy for robot manipulation is notoriously costly and time consuming as it depends on collecting a significant amount of real world data. To work well in the real world, the policy needs to see many instances of the task, including various object arrangements in the scene as well as variations in object geometry, texture, material, and environmental illumination. In this paper, we propose a method that learns to perform table-top instance grasping of a wide variety of objects while using no real world grasping data, outperforming the baseline using 2.5D shape by 10%. Our method learns 3D point cloud of object, and use that to train a domain-invariant grasping policy. We formulate the learning process as a two-step procedure: 1) Learning a domain-invariant 3D shape representation of objects from about 76K episodes in simulation and about 530 episodes in the real world, where each episode lasts less than a minute and 2) Learning a critic grasping policy in simulation only based on the 3D shape representation from step 1. Our real world data collection in step 1 is both cheaper and faster compared to existing approaches as it only requires taking multiple snapshots of the scene using a RGBD camera. Finally, the learned 3D representation is not specific to grasping, and can potentially be used in other interaction tasks.
CVJun 10, 2019
Online Object Representations with Contrastive LearningSören Pirk, Mohi Khansari, Yunfei Bai et al.
We propose a self-supervised approach for learning representations of objects from monocular videos and demonstrate it is particularly useful in situated settings such as robotics. The main contributions of this paper are: 1) a self-supervising objective trained with contrastive learning that can discover and disentangle object attributes from video without using any labels; 2) we leverage object self-supervision for online adaptation: the longer our online model looks at objects in a video, the lower the object identification error, while the offline baseline remains with a large fixed error; 3) to explore the possibilities of a system entirely free of human supervision, we let a robot collect its own data, train on this data with our self-supervise scheme, and then show the robot can point to objects similar to the one presented in front of it, demonstrating generalization of object attributes. An interesting and perhaps surprising finding of this approach is that given a limited set of objects, object correspondences will naturally emerge when using contrastive learning without requiring explicit positive pairs. Videos illustrating online object adaptation and robotic pointing are available at: https://online-objects.github.io/.
GRSep 27, 2016
Understanding and Exploiting Object Interaction LandscapesSören Pirk, Vojtech Krs, Kaimo Hu et al.
Interactions play a key role in understanding objects and scenes, for both virtual and real world agents. We introduce a new general representation for proximal interactions among physical objects that is agnostic to the type of objects or interaction involved. The representation is based on tracking particles on one of the participating objects and then observing them with sensors appropriately placed in the interaction volume or on the interaction surfaces. We show how to factorize these interaction descriptors and project them into a particular participating object so as to obtain a new functional descriptor for that object, its interaction landscape, capturing its observed use in a spatio-temporal framework. Interaction landscapes are independent of the particular interaction and capture subtle dynamic effects in how objects move and behave when in functional use. Our method relates objects based on their function, establishes correspondences between shapes based on functional key points and regions, and retrieves peer and partner objects with respect to an interaction.