Patricio A. Vela

33papers

1,866citations

Novelty48%

AI Score46

Ranked #60,260 of 201,326 authors (top 30%)#1,796 in RO (top 24%)

33 Papers

LGMar 28, 2023

Planning with Sequence Models through Iterative Energy Minimization

Hongyi Chen, Yilun Du, Yiye Chen et al. · mit

Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together. Project website: https://hychen-naza.github.io/projects/LEAP

CVOct 18, 2022

Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation

Yunzhi Lin, Thomas Müller, Jonathan Tremblay et al. · gatech

We present a parallelized optimization method based on fast Neural Radiance Fields (NeRF) for estimating 6-DoF pose of a camera with respect to an object or scene. Given a single observed RGB image of the target, we can predict the translation and rotation of the camera by minimizing the residual between pixels rendered from a fast NeRF model and pixels in the observed image. We integrate a momentum-based camera extrinsic optimization procedure into Instant Neural Graphics Primitives, a recent exceptionally fast NeRF implementation. By introducing parallel Monte Carlo sampling into the pose estimation task, our method overcomes local minima and improves efficiency in a more extensive search space. We also show the importance of adopting a more robust pixel-based loss function to reduce error. Experiments demonstrate that our method can achieve improved generalization and robustness on both synthetic and real-world benchmarks.

CVMay 23, 2022

Keypoint-Based Category-Level Object Pose Tracking from an RGB Sequence with Uncertainty Estimation

Yunzhi Lin, Jonathan Tremblay, Stephen Tyree et al. · gatech

We propose a single-stage, category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category. Our method takes as input the previous and current frame from a monocular RGB video, as well as predictions from the previous frame, to predict the bounding cuboid and 6-DoF pose (up to scale). Internally, a deep network predicts distributions over object keypoints (vertices of the bounding cuboid) in image coordinates, after which a novel probabilistic filtering process integrates across estimates before computing the final pose using PnP. Our framework allows the system to take previous uncertainties into consideration when predicting the current frame, resulting in predictions that are more accurate and stable than single frame methods. Extensive experiments show that our method outperforms existing approaches on the challenging Objectron benchmark of annotated object videos. We also demonstrate the usability of our work in an augmented reality setting.

CVMar 14, 2023

WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis

Yiye Chen, Yunzhi Lin, Ruinian Xu et al. · gatech

Deep neural networks are susceptible to generating overconfident yet erroneous predictions when presented with data beyond known concepts. This challenge underscores the importance of detecting out-of-distribution (OOD) samples in the open world. In this work, we propose a novel feature-space OOD detection score based on class-specific and class-agnostic information. Specifically, the approach utilizes Whitened Linear Discriminant Analysis to project features into two subspaces - the discriminative and residual subspaces - for which the in-distribution (ID) classes are maximally separated and closely clustered, respectively. The OOD score is then determined by combining the deviation from the input data to the ID pattern in both subspaces. The efficacy of our method, named WDiscOOD, is verified on the large-scale ImageNet-1k benchmark, with six OOD datasets that cover a variety of distribution shifts. WDiscOOD demonstrates superior performance on deep classifiers with diverse backbone architectures, including CNN and vision transformer. Furthermore, we also show that WDiscOOD more effectively detects novel concepts in representation spaces trained with contrastive objectives, including supervised contrastive loss and multi-modality contrastive loss.

ROMar 9, 2023

KGNv2: Separating Scale and Pose Prediction for Keypoint-based 6-DoF Grasp Synthesis on RGB-D input

Yiye Chen, Ruinian Xu, Yunzhi Lin et al. · gatech

We propose a new 6-DoF grasp pose synthesis approach from 2D/2.5D input based on keypoints. Keypoint-based grasp detector from image input has demonstrated promising results in the previous study, where the additional visual information provided by color images compensates for the noisy depth perception. However, it relies heavily on accurately predicting the location of keypoints in the image space. In this paper, we devise a new grasp generation network that reduces the dependency on precise keypoint estimation. Given an RGB-D input, our network estimates both the grasp pose from keypoint detection as well as scale towards the camera. We further re-design the keypoint output space in order to mitigate the negative impact of keypoint prediction noise to Perspective-n-Point (PnP) algorithm. Experiments show that the proposed method outperforms the baseline by a large margin, validating the efficacy of our approach. Finally, despite trained on simple synthetic objects, our method demonstrate sim-to-real capacity by showing competitive results in real-world robot experiments.

SYOct 11, 2022

Geometry of Radial Basis Neural Networks for Safety Biased Approximation of Unsafe Regions

Ahmad Abuaish, Mohit Srinivasan, Patricio A. Vela

Barrier function-based inequality constraints are a means to enforce safety specifications for control systems. When used in conjunction with a convex optimization program, they provide a computationally efficient method to enforce safety for the general class of control-affine systems. One of the main assumptions when taking this approach is the a priori knowledge of the barrier function itself, i.e., knowledge of the safe set. In the context of navigation through unknown environments where the locally safe set evolves with time, such knowledge does not exist. This manuscript focuses on the synthesis of a zeroing barrier function characterizing the safe set based on safe and unsafe sample measurements, e.g., from perception data in navigation applications. Prior work formulated a supervised machine learning algorithm whose solution guaranteed the construction of a zeroing barrier function with specific level-set properties. However, it did not explore the geometry of the neural network design used for the synthesis process. This manuscript describes the specific geometry of the neural network used for zeroing barrier function synthesis, and shows how the network provides the necessary representation for splitting the state space into safe and unsafe regions.

15.2ROApr 17

Factor Graph-Based Shape Estimation for Continuum Robots via Magnus Expansion

Lorenzo Ticozzi, Patricio A. Vela, Panagiotis Tsiotras

Reconstructing the shape of continuum manipulators from sparse, noisy sensor data is a challenging task, owing to the infinite-dimensional nature of such systems. Existing approaches broadly trade off between parametric methods that yield compact state representations but lack probabilistic structure, and Cosserat rod inference on factor graphs, which provides principled uncertainty quantification at the cost of a state dimension that grows with the spatial discretization. This letter combines the strength of both paradigms by estimating the coefficients of a low-dimensional Geometric Variable Strain (GVS) parameterization within a factor graph framework. A novel kinematic factor, derived from the Magnus expansion of the strain field, encodes the closed-form rod geometry as a prior constraint linking the GVS strain coefficients to the backbone pose variables. The resulting formulation yields a compact state vector directly amenable to model-based control, while retaining the modularity, probabilistic treatment and computational efficiency of factor graph inference. The proposed method is evaluated in simulation on a 0.4 m long tendon-driven continuum robot under three measurement configurations, achieving mean position errors below 2 mm for all three scenarios and demonstrating a sixfold reduction in orientation error compared to a Gaussian process regression baseline when only position measurements are available.

28.3ROApr 21

QuadPiPS: A Perception-informed Footstep Planner for Quadrupeds With Semantic Affordance Prediction

Max Asselmeier, Ye Zhao, Patricio A. Vela

This work proposes QuadPiPS, a perception-informed framework for quadrupedal foothold planning in the perception space. QuadPiPS employs a novel ego-centric local environment representation, known as the legged egocan, that is extended here to capture unique legged affordances through a joint geometric and semantic encoding that supports local motion planning and control for quadrupeds. QuadPiPS takes inspiration from the Augmented Leafs with Experience on Foliations (ALEF) planning framework to partition the foothold planning space into its discrete and continuous subspaces. To facilitate real-world deployment, QuadPiPS broadens the ALEF approach by synthesizing perception-informed, real-time, and kinodynamically-feasible reference trajectories through search and trajectory optimization techniques. To support deliberate and exhaustive searching, QuadPiPS over-segments the egocan floor via superpixels to provide a set of planar regions suitable for candidate footholds. Nonlinear trajectory optimization methods then compute swing trajectories to transition between selected footholds and provide long-horizon whole-body reference motions that are tracked under model predictive control and whole body control. Benchmarking with the ANYmal C quadruped across ten simulation environments and five baselines reveals that QuadPiPS excels in safety-critical settings with limited available footholds. Real-world validation on the Unitree Go2 quadruped equipped with a custom computational suite demonstrates that QuadPiPS enables terrain-aware locomotion on hardware.

ROFeb 25, 2022

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

Ruinian Xu, Hongyi Chen, Yunzhi Lin et al.

This paper investigates robot manipulation based on human instruction with ambiguous requests. The intent is to compensate for imperfect natural language via visual observations. Early symbolic methods, based on manually defined symbols, built modular framework consist of semantic parsing and task planning for producing sequences of actions from natural language requests. Modern connectionist methods employ deep neural networks to automatically learn visual and linguistic features and map to a sequence of low-level actions, in an endto-end fashion. These two approaches are blended to create a hybrid, modular framework: it formulates instruction following as symbolic goal learning via deep neural networks followed by task planning via symbolic planners. Connectionist and symbolic modules are bridged with Planning Domain Definition Language. The vision-and-language learning network predicts its goal representation, which is sent to a planner for producing a task-completing action sequence. For improving the flexibility of natural language, we further incorporate implicit human intents with explicit human instructions. To learn generic features for vision and language, we propose to separately pretrain vision and language encoders on scene graph parsing and semantic textual similarity tasks. Benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies. Manipulation experiments conducted in the simulator AI2THOR show the robustness of the framework to novel scenarios.

ROJan 4, 2022

Primitive Shape Recognition for Object Grasping

Yunzhi Lin, Chao Tang, Fu-Jen Chu et al.

Shape informs how an object should be grasped, both in terms of where and how. As such, this paper describes a segmentation-based architecture for decomposing objects sensed with a depth camera into multiple primitive shapes, along with a post-processing pipeline for robotic grasping. Segmentation employs a deep network, called PS-CNN, trained on synthetic data with 6 classes of primitive shapes and generated using a simulation engine. Each primitive shape is designed with parametrized grasp families, permitting the pipeline to identify multiple grasp candidates per shape region. The grasps are rank ordered, with the first feasible one chosen for execution. For task-free grasping of individual objects, the method achieves a 94.2% success rate placing it amongst the top performing grasp methods when compared to top-down and SE(3)-based approaches. Additional tests involving variable viewpoints and clutter demonstrate robustness to setup. For task-oriented grasping, PS-CNN achieves a 93.0% success rate. Overall, the outcomes support the hypothesis that explicitly encoding shape primitives within a grasping pipeline should boost grasping performance, including task-free and task-relevant grasp prediction.

CVSep 13, 2021

Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image

Yunzhi Lin, Jonathan Tremblay, Stephen Tyree et al.

Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6-DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for category-level object pose estimation that operates on unknown object instances within a known category using a single RGB image as input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6-DoF pose, and regresses relative bounding cuboid dimensions. These quantities are estimated in a sequential fashion, leveraging the recent idea of convGRU for propagating information from easier tasks to those that are more difficult. We favor simplicity in our design choices: generic cuboid vertex coordinates, single-stage network, and monocular RGB input. We conduct extensive experiments on the challenging Objectron benchmark, outperforming state-of-the-art methods on the 3D IoU metric (27.6% higher than the MobilePose single-stage approach and 7.1% higher than the related two-stage approach).

ROJun 16, 2021

GKNet: grasp keypoint network for grasp candidates detection

Ruinian Xu, Fu-Jen Chu, Patricio A. Vela

Contemporary grasp detection approaches employ deep learning to achieve robustness to sensor and object model uncertainty. The two dominant approaches design either grasp-quality scoring or anchor-based grasp recognition networks. This paper presents a different approach to grasp detection by treating it as keypoint detection in image-space. The deep network detects each grasp candidate as a pair of keypoints, convertible to the grasp representationg = {x, y, w, θ} T , rather than a triplet or quartet of corner points. Decreasing the detection difficulty by grouping keypoints into pairs boosts performance. To promote capturing dependencies between keypoints, a non-local module is incorporated into the network design. A final filtering strategy based on discrete and continuous orientation prediction removes false correspondences and further improves grasp detection performance. GKNet, the approach presented here, achieves a good balance between accuracy and speed on the Cornell and the abridged Jacquard datasets (96.9% and 98.39% at 41.67 and 23.26 fps). Follow-up experiments on a manipulator evaluate GKNet using 4 types of grasping experiments reflecting different nuisance sources: static grasping, dynamic grasping, grasping at varied camera angles, and bin picking. GKNet outperforms reference baselines in static and dynamic grasping experiments while showing robustness to varied camera viewpoints and moderate clutter. The results confirm the hypothesis that grasp keypoints are an effective output representation for deep grasp networks that provide robustness to expected nuisance factors.

CVApr 1, 2021

A Joint Network for Grasp Detection Conditioned on Natural Language Commands

Yiye Chen, Ruinian Xu, Yunzhi Lin et al.

We consider the task of grasping a target object based on a natural language command query. Previous work primarily focused on localizing the object given the query, which requires a separate grasp detection module to grasp it. The cascaded application of two pipelines incurs errors in overlapping multi-object cases due to ambiguity in the individual outputs. This work proposes a model named Command Grasping Network(CGNet) to directly output command satisficing grasps from RGB image and textual command inputs. A dataset with ground truth (image, command, grasps) tuple is generated based on the VMRD dataset to train the proposed network. Experimental results on the generated test set show that CGNet outperforms a cascaded object-retrieval and grasp detection baseline by a large margin. Three physical experiments demonstrate the functionality and performance of CGNet.

ROMar 25, 2021

Multi-View Fusion for Multi-Level Robotic Scene Understanding

Yunzhi Lin, Jonathan Tremblay, Stephen Tyree et al.

We present a system for multi-level scene awareness for robotic manipulation. Given a sequence of camera-in-hand RGB images, the system calculates three types of information: 1) a point cloud representation of all the surfaces in the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown objects from categories corresponding to primitive shapes (e.g., cuboids and cylinders); and 3) full 6-DoF pose of known objects. By developing and fusing recent techniques in these domains, we provide a rich scene representation for robot awareness. We demonstrate the importance of each of these modules, their complementary nature, and the potential benefits of the system in the context of robotic manipulation.

ROMar 21, 2021

Potential Gap: Using Reactive Policies to Guarantee Safe Navigation

Ruoyang Xu, Shiyu Feng, Patricio A. Vela

This paper considers the integration of gap-based local navigation methods with artificial potential field (APF) methods to derive a local planning module for hierarchical navigation systems that has provable collision-free properties. Given that APF theory applies to idealized robot models, the provable properties are lost when applied to more realistic models. We describe a set of algorithm modifications that correct for these errors and enhance robustness to non-ideal models. Central to the construction of the local planner is the use of sensory-derived local free-space models that detect gaps and use them for the synthesis of the APF. Modifications are given for a nonholonomic robot model. Integration of the local planner, called potential gap, into a hierarchical navigation system provides the local goals and trajectories needed for collision-free navigation through unknown environments. Monte Carlo experiments in benchmark worlds confirm the asserted safety and robustness properties by testing under various robot models.

ROMar 2, 2021

NavTuner: Learning a Scene-Sensitive Family of Navigation Policies

Haoxin Ma, Justin S. Smith, Patricio A. Vela

The advent of deep learning has inspired research into end-to-end learning for a variety of problem domains in robotics. For navigation, the resulting methods may not have the generalization properties desired let alone match the performance of traditional methods. Instead of learning a navigation policy, we explore learning an adaptive policy in the parameter space of an existing navigation module. Having adaptive parameters provides the navigation module with a family of policies that can be dynamically reconfigured based on the local scene structure, and addresses the common assertion in machine learning that engineered solutions are inflexible. Of the methods tested, reinforcement learning (RL) is shown to provide a significant performance boost to a modern navigation method through reduced sensitivity of its success rate to environmental clutter. The outcomes indicate that RL as a meta-policy learner, or dynamic parameter tuner, effectively robustifies algorithms sensitive to external, measurable nuisance factors.

ROFeb 26, 2021

Image-Based Trajectory Tracking through Unknown Environments without Absolute Positioning

Shiyu Feng, Zixuan Wu, Yipu Zhao et al.

This paper describes a stereo image-based visual servoing system for trajectory tracking by a non-holonomic robot without externally derived pose information nor a known visual map of the environment. It is called trajectory servoing. The critical component is a feature-based, indirect Simultaneous Localization And Mapping (SLAM) method to provide a pool of available features with estimated depth, so that they may be propagated forward in time to generate image feature trajectories for visual servoing. Short and long distance experiments show the benefits of trajectory servoing for navigating unknown areas without absolute positioning. Empirically, trajectory servoing has better trajectory tracking performance than pose-based feedback when both rely on the same underlying SLAM system.

CVAug 23, 2020

Good Graph to Optimize: Cost-Effective, Budget-Aware Bundle Adjustment in Visual SLAM

Yipu Zhao, Justin S. Smith, Patricio A. Vela

The cost-efficiency of visual(-inertial) SLAM (VSLAM) is a critical characteristic of resource-limited applications. While hardware and algorithm advances have been significantly improved the cost-efficiency of VSLAM front-ends, the cost-efficiency of VSLAM back-ends remains a bottleneck. This paper describes a novel, rigorous method to improve the cost-efficiency of local BA in a BA-based VSLAM back-end. An efficient algorithm, called Good Graph, is developed to select size-reduced graphs optimized in local BA with condition preservation. To better suit BA-based VSLAM back-ends, the Good Graph predicts future estimation needs, dynamically assigns an appropriate size budget, and selects a condition-maximized subgraph for BA estimation. Evaluations are conducted on two scenarios: 1) VSLAM as standalone process, and 2) VSLAM as part of closed-loop navigation system. Results from the first scenario show Good Graph improves accuracy and robustness of VSLAM estimation, when computational limits exist. Results from the second scenario, indicate that Good Graph benefits the trajectory tracking performance of VSLAM-based closed-loop navigation systems, which is a primary application of VSLAM.

ROMar 3, 2020

Closed-Loop Benchmarking of Stereo Visual-Inertial SLAM Systems: Understanding the Impact of Drift and Latency on Tracking Accuracy

Yipu Zhao, Justin S. Smith, Sambhu H. Karumanchi et al.

Visual-inertial SLAM is essential for robot navigation in GPS-denied environments, e.g. indoor, underground. Conventionally, the performance of visual-inertial SLAM is evaluated with open-loop analysis, with a focus on the drift level of SLAM systems. In this paper, we raise the question on the importance of visual estimation latency in closed-loop navigation tasks, such as accurate trajectory tracking. To understand the impact of both drift and latency on visual-inertial SLAM systems, a closed-loop benchmarking simulation is conducted, where a robot is commanded to follow a desired trajectory using the feedback from visual-inertial estimation. By extensively evaluating the trajectory tracking performance of representative state-of-the-art visual-inertial SLAM systems, we reveal the importance of latency reduction in visual estimation module of these systems. The findings suggest directions of future improvements for visual-inertial SLAM.

ROJan 3, 2020

Good Feature Matching: Towards Accurate, Robust VO/VSLAM with Low Latency

Yipu Zhao, Patricio A. Vela

Analysis of state-of-the-art VO/VSLAM system exposes a gap in balancing performance (accuracy & robustness) and efficiency (latency). Feature-based systems exhibit good performance, yet have higher latency due to explicit data association; direct & semidirect systems have lower latency, but are inapplicable in some target scenarios or exhibit lower accuracy than feature-based ones. This paper aims to fill the performance-efficiency gap with an enhancement applied to feature-based VSLAM. We present good feature matching, an active map-to-frame feature matching method. Feature matching effort is tied to submatrix selection, which has combinatorial time complexity and requires choosing a scoring metric. Via simulation, the Max-logDet matrix revealing metric is shown to perform best. For real-time applicability, the combination of deterministic selection and randomized acceleration is studied. The proposed algorithm is integrated into monocular & stereo feature-based VSLAM systems. Extensive evaluations on multiple benchmarks and compute hardware quantify the latency reduction and the accuracy & robustness preservation.

CVSep 12, 2019

Using Synthetic Data and Deep Networks to Recognize Primitive Shapes for Object Grasping

Yunzhi Lin, Chao Tang, Fu-Jen Chu et al.

A segmentation-based architecture is proposed to decompose objects into multiple primitive shapes from monocular depth input for robotic manipulation. The backbone deep network is trained on synthetic data with 6 classes of primitive shapes generated by a simulation engine. Each primitive shape is designed with parametrized grasp families, permitting the pipeline to identify multiple grasp candidates per shape primitive region. The grasps are priority ordered via proposed ranking algorithm, with the first feasible one chosen for execution. On task-free grasping of individual objects, the method achieves a 94% success rate. On task-oriented grasping, it achieves a 76% success rate. Overall, the method supports the hypothesis that shape primitives can support task-free and task-relevant grasp prediction.

ROSep 12, 2019

Recognizing Object Affordances to Support Scene Reasoning for Manipulation Tasks

Fu-Jen Chu, Ruinian Xu, Chao Tang et al.

Affordance information about a scene provides important clues as to what actions may be executed in pursuit of meeting a specified goal state. Thus, integrating affordance-based reasoning into symbolic action plannning pipelines would enhance the flexibility of robot manipulation. Unfortunately, the top performing affordance recognition methods use object category priors to boost the accuracy of affordance detection and segmentation. Object priors limit generalization to unknown object categories. This paper describes an affordance recognition pipeline based on a category-agnostic region proposal network for proposing instance regions of an image across categories. To guide affordance learning in the absence of category priors, the training process includes the auxiliary task of explicitly inferencing existing affordances within a proposal. Secondly, a self-attention mechanism trained to interpret each proposal learns to capture rich contextual dependencies through the region. Visual benchmarking shows that the trained network, called AffContext, reduces the performance gap between object-agnostic and object-informed affordance recognition. AffContext is linked to the Planning Domain Definition Language (PDDL) with an augmented state keeper for action planning across temporally spaced goal-oriented tasks. Manipulation experiments show that AffContext can successfully parse scene content to seed a symbolic planner problem specification, whose execution completes the target task. Additionally, task-oriented grasping for cutting and pounding actions demonstrate the exploitation of multiple affordances for a given object to complete specified tasks.

ROAug 19, 2019

Autonomous, Monocular, Vision-Based Snake Robot Navigation and Traversal of Cluttered Environments using Rectilinear Gait Motion

Alexander H. Chang, Shiyu Feng, Yipu Zhao et al.

Rectilinear forms of snake-like robotic locomotion are anticipated to be an advantage in obstacle-strewn scenarios characterizing urban disaster zones, subterranean collapses, and other natural environments. The elongated, laterally-narrow footprint associated with these motion strategies is well-suited to traversal of confined spaces and narrow pathways. Navigation and path planning in the absence of global sensing, however, remains a pivotal challenge to be addressed prior to practical deployment of these robotic mechanisms. Several challenges related to visual processing and localization need to be resolved to to enable navigation. As a first pass in this direction, we equip a wireless, monocular color camera to the head of a robotic snake. Visiual odometry and mapping from ORB-SLAM permits self-localization in planar, obstacle-strewn environments. Ground plane traversability segmentation in conjunction with perception-space collision detection permits path planning for navigation. A previously presented dynamical reduction of rectilinear snake locomotion to a non-holonomic kinematic vehicle informs both SLAM and planning. The simplified motion model is then applied to track planned trajectories through an obstacle configuration. This navigational framework enables a snake-like robotic platform to autonomously navigate and traverse unknown scenarios with only monocular vision.

ROMay 19, 2019

Characterizing SLAM Benchmarks and Methods for the Robust Perception Age

Wenkai Ye, Yipu Zhao, Patricio A. Vela

The diversity of SLAM benchmarks affords extensive testing of SLAM algorithms to understand their performance, individually or in relative terms. The ad-hoc creation of these benchmarks does not necessarily illuminate the particular weak points of a SLAM algorithm when performance is evaluated. In this paper, we propose to use a decision tree to identify challenging benchmark properties for state-of-the-art SLAM algorithms and important components within the SLAM pipeline regarding their ability to handle these challenges. Establishing what factors of a particular sequence lead to track failure or degradation relative to these characteristics is important if we are to arrive at a strong understanding for the core computational needs of a robust SLAM algorithm. Likewise, we argue that it is important to profile the computational performance of the individual SLAM components for use when benchmarking. In particular, we advocate the use of time-dilation during ROS bag playback, or what we refer to as slo-mo playback. Using slo-mo to benchmark SLAM instantiations can provide clues to how SLAM implementations should be improved at the computational component level. Three prevalent VO/SLAM algorithms and two low-latency algorithms of our own are tested on selected typical sequences, which are generated from benchmark characterization, to further demonstrate the benefits achieved from computationally efficient components.

ROMay 19, 2019

Good Feature Selection for Least Squares Pose Optimization in VO/VSLAM

Yipu Zhao, Patricio A. Vela

This paper aims to select features that contribute most to the pose estimation in VO/VSLAM. Unlike existing feature selection works that are focused on efficiency only, our method significantly improves the accuracy of pose tracking, while introducing little overhead. By studying the impact of feature selection towards least squares pose optimization, we demonstrate the applicability of improving accuracy via good feature selection. To that end, we introduce the Max-logDet metric to guide the feature selection, which is connected to the conditioning of least squares pose optimization problem. We then describe an efficient algorithm for approximately solving the NP-hard Max-logDet problem. Integrating Max-logDet feature selection into a state-of-the-art visual SLAM system leads to accuracy improvements with low overhead, as demonstrated via evaluation on a public benchmark.

ROMay 19, 2019

Low-latency Visual SLAM with Appearance-Enhanced Local Map Building

Yipu Zhao, Wenkai Ye, Patricio A. Vela

A local map module is often implemented in modern VO/VSLAM systems to improve data association and pose estimation. Conventionally, the local map contents are determined by co-visibility. While co-visibility is cheap to establish, it utilizes the relatively-weak temporal prior (i.e. seen before, likely to be seen now), therefore admitting more features into the local map than necessary. This paper describes an enhancement to co-visibility local map building by incorporating a strong appearance prior, which leads to a more compact local map and latency reduction in downstream data association. The appearance prior collected from the current image influences the local map contents: only the map features visually similar to the current measurements are potentially useful for data association. To that end, mapped features are indexed and queried with Multi-index Hashing (MIH). An online hash table selection algorithm is developed to further reduce the query overhead of MIH and the local map size. The proposed appearance-based local map building method is integrated into a state-of-the-art VO/VSLAM system. When evaluated on two public benchmarks, the size of the local map, as well as the latency of real-time pose tracking in VO/VSLAM are significantly reduced. Meanwhile, the VO/VSLAM mean performance is preserved or improves.

ROFeb 1, 2018

Real-world Multi-object, Multi-grasp Detection

Fu-Jen Chu, Ruinian Xu, Patricio A. Vela

A deep learning architecture is proposed to predict graspable locations for robotic manipulation. It considers situations where no, one, or multiple object(s) are seen. By defining the learning problem to be classification with null hypothesis competition instead of regression, the deep neural network with RGB-D image input predicts multiple grasp candidates for a single object or multiple objects, in a single shot. The method outperforms state-of-the-art approaches on the Cornell dataset with 96.0% and 96.1% accuracy on image-wise and object- wise splits, respectively. Evaluation on a multi-object dataset illustrates the generalization capability of the architecture. Grasping experiments achieve 96.0% grasp localization and 88.0% grasping success rates on a test set of household objects. The real-time process takes less than .25 s from image to plan.

ROFeb 1, 2018

The Helping Hand: An Assistive Manipulation Framework Using Augmented Reality and a Tongue-Drive Interfaces

Fu-Jen Chu, Ruinian Xu, Zhenxuan Zhang et al.

A human-in-the-loop system is proposed to enable collaborative manipulation tasks for person with physical disabilities. Studies show that the cognitive burden of subject reduces with increased autonomy of assistive system. Our framework obtains high-level intent from the user to specify manipulation tasks. The system processes sensor input to interpret the user's environment. Augmented reality glasses provide ego-centric visual feedback of the interpretation and summarize robot affordances on a menu. A tongue drive system serves as the input modality for triggering a robotic arm to execute the tasks. Assistance experiments compare the system to Cartesian control and to state-of-the-art approaches. Our system achieves competitive results with faster completion time by simplifying manipulation tasks.

ROJan 16, 2018

Learning to Navigate: Exploiting Deep Networks to Inform Sample-Based Planning During Vision-Based Navigation

Justin S. Smith, Jin-Ha Hwang, Fu-Jen Chu et al.

Recent applications of deep learning to navigation have generated end-to-end navigation solutions whereby visual sensor input is mapped to control signals or to motion primitives. The resulting visual navigation strategies work very well at collision avoidance and have performance that matches traditional reactive navigation algorithms while operating in real-time. It is accepted that these solutions cannot provide the same level of performance as a global planner. However, it is less clear how such end-to-end systems should be integrated into a full navigation pipeline. We evaluate the typical end-to-end solution within a full navigation pipeline in order to expose its weaknesses. Doing so illuminates how to better integrate deep learning methods into the navigation pipeline. In particular, we show that they are an efficient means to provide informed samples for sample-based planners. Controlled simulations with comparison against traditional planners show that the number of samples can be reduced by an order of magnitude while preserving navigation performance. Implementation on a mobile robot matches the simulated performance outcomes.

OCDec 16, 2017

Bendable Cuboid Robot Path Planning with Collision Avoidance using Generalized $L_p$ Norms

Nak-seung P. Hyun, Patricio A. Vela, Erik I. Verriest

Optimal path planning problems for rigid and deformable (bendable) cuboid robots are considered by providing an analytic safety constraint using generalized $L_p$ norms. For regular cuboid robots, level sets of weighted $L_p$ norms generate implicit approximations of their surfaces. For bendable cuboid robots a weighted $L_p$ norm in polar coordinates implicitly approximates the surface boundary through a specified level set. Obstacle volumes, in the environment to navigate within, are presumed to be approximately described as sub-level sets of weighted $L_p$ norms. Using these approximate surface models, the optimal safe path planning problem is reformulated as a two stage optimization problem, where the safety constraint depends on a point on the robot which is closest to the obstacle in the obstacle's distance metric. A set of equality and inequality constraints are derived to replace the closest point problem, which is then defines additional analytic constraints on the original path planning problem. Combining all the analytic constraints with logical AND operations leads to a general optimal safe path planning problem. Numerically solving the problem involve conversion to a nonlinear programing problem. Simulations for rigid and bendable cuboid robot verify the proposed method.

CVJan 15, 2016

Learning Binary Features Online from Motion Dynamics for Incremental Loop-Closure Detection and Place Recognition

Guangcong Zhang, Mason J. Lilly, Patricio A. Vela

This paper proposes a simple yet effective approach to learn visual features online for improving loop-closure detection and place recognition, based on bag-of-words frameworks. The approach learns a codeword in bag-of-words model from a pair of matched features from two consecutive frames, such that the codeword has temporally-derived perspective invariance to camera motion. The learning algorithm is efficient: the binary descriptor is generated from the mean image patch, and the mask is learned based on discriminative projection by minimizing the intra-class distances among the learned feature and the two original features. A codeword for bag-of-words models is generated by packaging the learned descriptor and mask, with a masked Hamming distance defined to measure the distance between two codewords. The geometric properties of the learned codewords are then mathematically justified. In addition, hypothesis constraints are imposed through temporal consistency in matched codewords, which improves precision. The approach, integrated in an incremental bag-of-words system, is validated on multiple benchmark data sets and compared to state-of-the-art methods. Experiments demonstrate improved precision/recall outperforming state of the art with little loss in runtime.

MLJul 26, 2015

Reduced-Set Kernel Principal Components Analysis for Improving the Training and Execution Speed of Kernel Machines

Hassan A. Kingravi, Patricio A. Vela, Alexandar Gray

This paper presents a practical, and theoretically well-founded, approach to improve the speed of kernel manifold learning algorithms relying on spectral decomposition. Utilizing recent insights in kernel smoothing and learning with integral operators, we propose Reduced Set KPCA (RSKPCA), which also suggests an easy-to-implement method to remove or replace samples with minimal effect on the empirical operator. A simple data point selection procedure is given to generate a substitute density for the data, with accuracy that is governed by a user-tunable parameter . The effect of the approximation on the quality of the KPCA solution, in terms of spectral and operator errors, can be shown directly in terms of the density estimate error and as a function of the parameter . We show in experiments that RSKPCA can improve both training and evaluation time of KPCA by up to an order of magnitude, and compares favorably to the widely-used Nystrom and density-weighted Nystrom methods.

LGSep 10, 2012

A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.