ROApr 14, 2023
EV-Catcher: High-Speed Object Catching Using Low-latency Event-based Neural NetworksZiyun Wang, Fernando Cladera Ojeda, Anthony Bisulco et al.
Event-based sensors have recently drawn increasing interest in robotic perception due to their lower latency, higher dynamic range, and lower bandwidth requirements compared to standard CMOS-based imagers. These properties make them ideal tools for real-time perception tasks in highly dynamic environments. In this work, we demonstrate an application where event cameras excel: accurately estimating the impact location of fast-moving objects. We introduce a lightweight event representation called Binary Event History Image (BEHI) to encode event data at low latency, as well as a learning-based approach that allows real-time inference of a confidence-enabled control signal to the robot. To validate our approach, we present an experimental catching system in which we catch fast-flying ping-pong balls. We show that the system is capable of achieving a success rate of 81% in catching balls targeted at different locations, with a velocity of up to 13 m/s even on compute-constrained embedded platforms such as the Nvidia Jetson NX.
CVNov 21, 2023Code
Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense KnowledgeBowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar et al.
This work introduces an enhanced approach to generating scene graphs by incorporating both a relationship hierarchy and commonsense knowledge. Specifically, we begin by proposing a hierarchical relation head that exploits an informative hierarchical structure. It jointly predicts the relation super-category between object pairs in an image, along with detailed relations under each super-category. Following this, we implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system, removing nonsensical predicates even with a small language-only model. Extensive experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms. The results show significant improvements with an extensive set of reasonable predictions beyond dataset annotations. Codes are available at https://github.com/bowen-upenn/scene_graph_commonsense.
CVMar 13, 2023
Hierarchical Relationships: A New Perspective to Enhance Scene Graph GenerationBowen Jiang, Camillo J. Taylor
This paper presents a finding that leveraging the hierarchical structures among labels for relationships and objects can substantially improve the performance of scene graph generation systems. The focus of this work is to create an informative hierarchical structure that can divide object and relationship categories into disjoint super-categories in a systematic way. Specifically, we introduce a Bayesian prediction head to jointly predict the super-category of relationships between a pair of object instances, as well as the detailed relationship within that super-category simultaneously, facilitating more informative predictions. The resulting model exhibits the capability to produce a more extensive set of predicates beyond the dataset annotations, and to tackle the prevalent issue of low annotation quality. While our paper presents preliminary findings, experiments on the Visual Genome dataset show its strong performance, particularly in predicate classifications and zero-shot settings, that demonstrates the promise of our approach.
CLApr 19, 2025Code
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at ScaleBowen Jiang, Zhuoqun Hao, Young-Min Cho et al.
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
ROSep 26, 2024
EvMAPPER: High Altitude Orthomapping with Event CamerasFernando Cladera, Kenneth Chaney, M. Ani Hsieh et al.
Traditionally, unmanned aerial vehicles (UAVs) rely on CMOS-based cameras to collect images about the world below. One of the most successful applications of UAVs is to generate orthomosaics or orthomaps, in which a series of images are integrated together to develop a larger map. However, the use of CMOS-based cameras with global or rolling shutters mean that orthomaps are vulnerable to challenging light conditions, motion blur, and high-speed motion of independently moving objects under the camera. Event cameras are less sensitive to these issues, as their pixels are able to trigger asynchronously on brightness changes. This work introduces the first orthomosaic approach using event cameras. In contrast to existing methods relying only on CMOS cameras, our approach enables map generation even in challenging light conditions, including direct sunlight and after sunset.
CLFeb 3
One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social IntelligenceBowen Jiang, Taiwei Shi, Ryo Kamoi et al.
This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.
CVFeb 16, 2025Code
ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font AnnotationsBowen Jiang, Yuan Yuan, Xinyi Bai et al.
This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering. Code is available at github.com/bowen-upenn/ControlText.
ROMay 13
LMPath: Language-Mediated Priors and Path Generation for Aerial ExplorationJonathan A. Diller, Fernando Cladera, Camillo J. Taylor et al.
Traditional autonomous UAV search missions rely on geometric coverage patterns that ignore the semantic context of the target, leading to significant time waste in large-scale environments. In this paper we present LMPath, a pipeline for generating language-mediated exploration priors for Unmanned Aerial Vehicle (UAV) search missions that leverages semantics. Given a basic geofence and an object of interest prompt, LMPath uses generative language models to determine what regions of the environment should contain that object and a foundation vision model ran over satellite imagery to segment sub-regions that form the exploration prior. This prior can then be used to generate UAV paths with various objectives, such as minimizing the expected time to locate the object of interest, maximizing the probability that the object is found given a limited travel distance, or narrowing down the search space to sub-regions that are most likely to contain the object. To demonstrate it's capabilities, we used LMPath to generate various UAV paths and ran them using a real UAV over large-scale environments. We also ran simulations to demonstrate how paths generated using LMPath outperform traditional path planning approaches for search missions.
CLJun 16, 2024Code
A Peek into Token Bias: Large Language Models Are Not Yet Genuine ReasonersBowen Jiang, Yangxinyu Xie, Zhuoqun Hao et al.
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities. Codes and data are open-sourced at https://github.com/bowen-upenn/llm_token_bias.
AIJun 1, 2024Code
Towards Rationality in Language and Multimodal Agents: A SurveyBowen Jiang, Yangxinyu Xie, Xiaomeng Wang et al.
This work discusses how to build more rational language and multimodal agents and what criteria define rationality in intelligent systems. Rationality is the quality of being guided by reason, characterized by decision-making that aligns with evidence and logical principles. It plays a crucial role in reliable problem-solving by ensuring well-grounded and consistent solutions. Despite their progress, large language models (LLMs) often fall short of rationality due to their bounded knowledge space and inconsistent outputs. In response, recent efforts have shifted toward developing multimodal and multi-agent systems, as well as integrating modules like external tools, programming codes, symbolic reasoners, utility function, and conformal risk controls rather than relying solely on a single LLM for decision-making. This paper surveys state-of-the-art advancements in language and multimodal agents, assesses their role in enhancing rationality, and outlines open challenges and future research directions. We maintain an open repository at https://github.com/bowen-upenn/Agent_Rationality.
ROOct 4, 2021Code
LLOL: Low-Latency Odometry for Spinning LidarsChao Qu, Shreyas S. Shivakumar, Wenxin Liu et al.
In this paper, we present a low-latency odometry system designed for spinning lidars. Many existing lidar odometry methods wait for an entire sweep from the lidar before processing the data. This introduces a large delay between the first laser firing and its pose estimate. To reduce this latency, we treat the spinning lidar as a streaming sensor and process packets as they arrive. This effectively distributes expensive operations across time, resulting in a very fast and lightweight system with much higher throughput and lower latency. Our open-source implementation is available at \url{https://github.com/versatran01/llol}.
RONov 30, 2017Code
Robust Stereo Visual Inertial Odometry for Fast Autonomous FlightKe Sun, Kartik Mohta, Bernd Pfrommer et al.
In recent years, vision-aided inertial odometry for state estimation has matured significantly. However, we still encounter challenges in terms of improving the computational efficiency and robustness of the underlying algorithms for applications in autonomous flight with micro aerial vehicles in which it is difficult to use high quality sensors and pow- erful processors because of constraints on size and weight. In this paper, we present a filter-based stereo visual inertial odometry that uses the Multi-State Constraint Kalman Filter (MSCKF) [1]. Previous work on stereo visual inertial odometry has resulted in solutions that are computationally expensive. We demonstrate that our Stereo Multi-State Constraint Kalman Filter (S-MSCKF) is comparable to state-of-art monocular solutions in terms of computational cost, while providing signifi- cantly greater robustness. We evaluate our S-MSCKF algorithm and compare it with state-of-art methods including OKVIS, ROVIO, and VINS-MONO on both the EuRoC dataset, and our own experimental datasets demonstrating fast autonomous flight with maximum speed of 17.5m/s in indoor and outdoor environments. Our implementation of the S-MSCKF is available at https://github.com/KumarRobotics/msckf_vio.
AIFeb 12, 2024
WildfireGPT: Tailored Large Language Model for Wildfire AnalysisYangxinyu Xie, Bowen Jiang, Tanwi Mallick et al.
Recent advancement of large language models (LLMs) represents a transformational capability at the frontier of artificial intelligence. However, LLMs are generalized models, trained on extensive text corpus, and often struggle to provide context-specific information, particularly in areas requiring specialized knowledge, such as wildfire details within the broader context of climate change. For decision-makers focused on wildfire resilience and adaptation, it is crucial to obtain responses that are not only precise but also domain-specific. To that end, we developed WildfireGPT, a prototype LLM agent designed to transform user queries into actionable insights on wildfire risks. We enrich WildfireGPT by providing additional context, such as climate projections and scientific literature, to ensure its information is current, relevant, and scientifically accurate. This enables WildfireGPT to be an effective tool for delivering detailed, user-specific insights on wildfire risks to support a diverse set of end users, including but not limited to researchers and engineers, for making positive impact and decision making.
CVMar 21, 2024
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question AnsweringBowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar et al.
This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
CLApr 24, 2025
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and AdaptationYangxinyu Xie, Bowen Jiang, Tanwi Mallick et al.
Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.
ROMay 14, 2025
Air-Ground Collaboration for Language-Specified Missions in Unknown EnvironmentsFernando Cladera, Zachary Ravichandran, Jason Hughes et al.
As autonomous robotic systems become increasingly mature, users will want to specify missions at the level of intent rather than in low-level detail. Language is an expressive and intuitive medium for such mission specification. However, realizing language-guided robotic teams requires overcoming significant technical hurdles. Interpreting and realizing language-specified missions requires advanced semantic reasoning. Successful heterogeneous robots must effectively coordinate actions and share information across varying viewpoints. Additionally, communication between robots is typically intermittent, necessitating robust strategies that leverage communication opportunities to maintain coordination and achieve mission objectives. In this work, we present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) are able to collaboratively accomplish missions specified in natural language while reacting to changes in specification on the fly. We leverage a Large Language Model (LLM)-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot. We consider task-driven navigation in urban and rural areas. Our system must infer mission-relevant semantics and actively acquire information via semantic mapping. In both ground and air-ground teaming experiments, we demonstrate our system on seven different natural-language specifications at up to kilometer-scale navigation.
ROMay 14, 2025
Deploying Foundation Model-Enabled Air and Ground Robots in the Field: Challenges and OpportunitiesZachary Ravichandran, Fernando Cladera, Jason Hughes et al.
The integration of foundation models (FMs) into robotics has enabled robots to understand natural language and reason about the semantics in their environments. However, existing FM-enabled robots primary operate in closed-world settings, where the robot is given a full prior map or has a full view of its workspace. This paper addresses the deployment of FM-enabled robots in the field, where missions often require a robot to operate in large-scale and unstructured environments. To effectively accomplish these missions, robots must actively explore their environments, navigate obstacle-cluttered terrain, handle unexpected sensor inputs, and operate with compute constraints. We discuss recent deployments of SPINE, our LLM-enabled autonomy framework, in field robotic settings. To the best of our knowledge, we present the first demonstration of large-scale LLM-enabled robot planning in unstructured environments with several kilometers of missions. SPINE is agnostic to a particular LLM, which allows us to distill small language models capable of running onboard size, weight and power (SWaP) limited platforms. Via preliminary model distillation work, we then present the first language-driven UAV planner using on-device language models. We conclude our paper by proposing several promising directions for future research.
ROSep 14, 2021
Large-scale Autonomous Flight with Real-time Semantic SLAM under Dense Forest CanopyXu Liu, Guilherme V. Nardari, Fernando Cladera Ojeda et al.
Semantic maps represent the environment using a set of semantically meaningful objects. This representation is storage-efficient, less ambiguous, and more informative, thus facilitating large-scale autonomy and the acquisition of actionable information in highly unstructured, GPS-denied environments. In this letter, we propose an integrated system that can perform large-scale autonomous flights and real-time semantic mapping in challenging under-canopy environments. We detect and model tree trunks and ground planes from LiDAR data, which are associated across scans and used to constrain robot poses as well as tree trunk models. The autonomous navigation module utilizes a multi-level planning and mapping framework and computes dynamically feasible trajectories that lead the UAV to build a semantic map of the user-defined region of interest in a computationally and storage efficient manner. A drift-compensation mechanism is designed to minimize the odometry drift using semantic SLAM outputs in real time, while maintaining planner optimality and controller stability. This leads the UAV to execute its mission accurately and safely at scale.
CVMar 29, 2021
Bayesian Deep Basis Fitting for Depth Completion with UncertaintyChao Qu, Wenxin Liu, Camillo J. Taylor
In this work we investigate the problem of uncertainty estimation for image-guided depth completion. We extend Deep Basis Fitting (DBF) for depth completion within a Bayesian evidence framework to provide calibrated per-pixel variance. The DBF approach frames the depth completion problem in terms of a network that produces a set of low-dimensional depth bases and a differentiable least squares fitting module that computes the basis weights using the sparse depths. By adopting a Bayesian treatment, our Bayesian Deep Basis Fitting (BDBF) approach is able to 1) predict high-quality uncertainty estimates and 2) enable depth completion with few or no sparse measurements. We conduct controlled experiments to compare BDBF against commonly used techniques for uncertainty estimation under various scenarios. Results show that our method produces better uncertainty estimates with accurate depth prediction.
CVSep 22, 2020
PennSyn2Real: Training Object Recognition Models without Human LabelingTy Nguyen, Ian D. Miller, Avi Cohen et al.
Scalable training data generation is a critical problem in deep learning. We propose PennSyn2Real - a photo-realistic synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs). The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. Our data generation framework bootstraps chroma-keying, a mature cinematography technique with a motion tracking system, providing artifact-free and curated annotated images where object orientations and lighting are controlled. This framework is easy to set up and can be applied to a broad range of objects, reducing the gap between synthetic and real-world data. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation. We demonstrate competitive performance in comparison with training using only real images. Furthermore, bootstrapping the generated synthetic data in few-shot learning can significantly improve the overall performance, reducing the number of required training data samples to achieve the desired accuracy.
CVDec 21, 2019
Depth Completion via Deep Basis FittingChao Qu, Ty Nguyen, Camillo J. Taylor
In this paper we consider the task of image-guided depth completion where our system must infer the depth at every pixel of an input image based on the image content and a sparse set of depth measurements. We propose a novel approach that builds upon the strengths of modern deep learning techniques and classical optimization algorithms and significantly improves performance. The proposed method replaces the final $1\times 1$ convolutional layer employed in most depth completion networks with a least squares fitting module which computes weights by fitting the implicit depth bases to the given sparse depth measurements. In addition, we show how our proposed method can be naturally extended to a multi-scale formulation for improved self-supervised training. We demonstrate through extensive experiments on various datasets that our approach achieves consistent improvements over state-of-the-art baseline methods with small computational overhead.
CVSep 20, 2019
PST900: RGB-Thermal Calibration, Dataset and Segmentation NetworkShreyas S. Shivakumar, Neil Rodrigues, Alex Zhou et al.
In this work we propose long wave infrared (LWIR) imagery as a viable supporting modality for semantic segmentation using learning-based techniques. We first address the problem of RGB-thermal camera calibration by proposing a passive calibration target and procedure that is both portable and easy to use. Second, we present PST900, a dataset of 894 synchronized and calibrated RGB and Thermal image pairs with per pixel human annotations across four distinct classes from the DARPA Subterranean Challenge. Lastly, we propose a CNN architecture for fast semantic segmentation that combines both RGB and Thermal imagery in a way that leverages RGB imagery independently. We compare our method against the state-of-the-art and show that our method outperforms them in our dataset.
ROSep 18, 2019
Vision-based Multi-MAV Localization with Anonymous Relative Measurements Using Coupled Probabilistic Data Association FilterTy Nguyen, Kartik Mohta, Camillo J. Taylor et al.
We address the localization of robots in a multi-MAV system where external infrastructure like GPS or motion capture systems may not be available. Our approach lends itself to implementation on platforms with several constraints on size, weight, and power (SWaP). Particularly, our framework fuses the onboard VIO with the anonymous, visual-based robot-to-robot detection to estimate all robot poses in one common frame, addressing three main challenges: 1) the initial configuration of the robot team is unknown, 2) the data association between each vision-based detection and robot targets is unknown, and 3) the vision-based detection yields false negatives, false positives, inaccurate, and provides noisy bearing, distance measurements of other robots. Our approach extends the Coupled Probabilistic Data Association Filter (CPDAF)[1] to cope with nonlinear measurements. We demonstrate the superior performance of our approach over a simple VIO-based method in a simulation with the measurement models statistically modeled using the real experimental data. We also show how onboard sensing, estimation, and control can be used for formation flight.
CVApr 3, 2019
MAVNet: an Effective Semantic Segmentation Micro-Network for MAV-based TasksTy Nguyen, Shreyas S. Shivakumar, Ian D. Miller et al.
Real-time semantic image segmentation on platforms subject to size, weight and power (SWaP) constraints is a key area of interest for air surveillance and inspection. In this work, we propose MAVNet: a small, light-weight, deep neural network for real-time semantic segmentation on micro Aerial Vehicles (MAVs). MAVNet, inspired by ERFNet, features 400 times fewer parameters and achieves comparable performance with some reference models in empirical experiments. Our model achieves a trade-off between speed and accuracy, achieving up to 48 FPS on an NVIDIA 1080Ti and 9 FPS on the NVIDIA Jetson Xavier when processing high resolution imagery. Additionally, we provide two novel datasets that represent challenges in semantic segmentation for real-time MAV tracking and infrastructure inspection tasks and verify MAVNet on these datasets. Our algorithm and datasets are made publicly available.
CVMar 15, 2019
DFineNet: Ego-Motion Estimation and Depth Refinement from Sparse, Noisy Depth Input with RGB GuidanceYilun Zhang, Ty Nguyen, Ian D. Miller et al.
Depth estimation is an important capability for autonomous vehicles to understand and reconstruct 3D environments as well as avoid obstacles during the execution. Accurate depth sensors such as LiDARs are often heavy, expensive and can only provide sparse depth while lighter depth sensors such as stereo cameras are noiser in comparison. We propose an end-to-end learning algorithm that is capable of using sparse, noisy input depth for refinement and depth completion. Our model also produces the camera pose as a byproduct, making it a great solution for autonomous systems. We evaluate our approach on both indoor and outdoor datasets. Empirical results show that our method performs well on the KITTI~\cite{kitti_geiger2012we} dataset when compared to other competing methods, while having superior performance in dealing with sparse, noisy input depth on the TUM~\cite{sturm12iros} dataset.
CVFeb 2, 2019
DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth CompletionShreyas S. Shivakumar, Ty Nguyen, Ian D. Miller et al.
In this paper we propose a convolutional neural network that is designed to upsample a series of sparse range measurements based on the contextual cues gleaned from a high resolution intensity image. Our approach draws inspiration from related work on super-resolution and in-painting. We propose a novel architecture that seeks to pull contextual cues separately from the intensity image and the depth features and then fuse them later in the network. We argue that this approach effectively exploits the relationship between the two modalities and produces accurate results while respecting salient image structures. We present experimental results to demonstrate that our approach is comparable with state of the art methods and generalizes well across multiple datasets.
CVNov 19, 2018
Predictive and Semantic Layout Estimation for Robotic Applications in Manhattan WorldsArmon Shariati, Bernd Pfrommer, Camillo J. Taylor
This paper describes an approach to automatically extracting floor plans from the kinds of incomplete measurements that could be acquired by an autonomous mobile robot. The approach proceeds by reasoning about extended structural layout surfaces which are automatically extracted from the available data. The scheme can be run in an online manner to build water tight representations of the environment. The system effectively speculates about room boundaries and free space regions which provides useful guidance to subsequent motion planning systems. Experimental results are presented on multiple data sets.
RONov 4, 2018
Monocular Camera Based Fruit Counting and Mapping with Semantic Data AssociationXu Liu, Steven W. Chen, Chenhao Liu et al.
We present a cheap, lightweight, and fast fruit counting pipeline that uses a single monocular camera. Our pipeline that relies only on a monocular camera, achieves counting performance comparable to state-of-the-art fruit counting system that utilizes an expensive sensor suite including LiDAR and GPS/INS on a mango dataset. Our monocular camera pipeline begins with a fruit detection component that uses a deep neural network. It then uses semantic structure from motion (SFM) to convert these detections into fruit counts by estimating landmark locations of the fruit in 3D, and using these landmarks to identify double counting scenarios. There are many benefits of developing a low cost and lightweight fruit counting system, including applicability to agriculture in developing countries, where monetary constraints or unstructured environments necessitate cheaper hardware solutions.
CVSep 20, 2018
Real Time Dense Depth Estimation by Fusing Stereo with Sparse Depth MeasurementsShreyas S. Shivakumar, Kartik Mohta, Bernd Pfrommer et al.
We present an approach to depth estimation that fuses information from a stereo pair with sparse range measurements derived from a LIDAR sensor or a range camera. The goal of this work is to exploit the complementary strengths of the two sensor modalities, the accurate but sparse range measurements and the ambiguous but dense stereo information. These two sources are effectively and efficiently fused by combining ideas from anisotropic diffusion and semi-global matching. We evaluate our approach on the KITTI 2015 and Middlebury 2014 datasets, using randomly sampled ground truth range measurements as our sparse depth input. We achieve significant performance improvements with a small fraction of range measurements on both datasets. We also provide qualitative results from our platform using the PMDTec Monstar sensor. Our entire pipeline runs on an NVIDIA TX-2 platform at 5Hz on 1280x1024 stereo images with 128 disparity levels.
ROSep 20, 2018
The Open Vision Computer: An Integrated Sensing and Compute System for Mobile RobotsMorgan Quigley, Kartik Mohta, Shreyas S. Shivakumar et al.
In this paper we describe the Open Vision Computer (OVC) which was designed to support high speed, vision guided autonomous drone flight. In particular our aim was to develop a system that would be suitable for relatively small-scale flying platforms where size, weight, power consumption and computational performance were all important considerations. This manuscript describes the primary features of our OVC system and explains how they are used to support fully autonomous indoor and outdoor exploration and navigation operations on our Falcon 250 quadrotor platform.
CVSep 18, 2018
U-Net for MAV-based Penstock Inspection: an Investigation of Focal Loss in Multi-class Segmentation for Corrosion IdentificationTy Nguyen, Tolga Ozaslan, Ian D. Miller et al.
Periodical inspection and maintenance of critical infrastructure such as dams, penstocks, and locks are of significant importance to prevent catastrophic failures. Conventional manual inspection methods require inspectors to climb along a penstock to spot corrosion, rust and crack formation which is unsafe, labor-intensive, and requires intensive training. This work presents an alternative approach using a Micro Aerial Vehicle (MAV) that autonomously flies to collect imagery which is then fed into a pretrained deep-learning model to identify corrosion. Our simplified U-Net trained with less than 40 image samples can do inference at 12 fps on a single GPU. We analyze different loss functions to solve the class imbalance problem, followed by a discussion on choosing proper metrics and weights for object classes. Results obtained with the dataset collected from Center Hill Dam, TN show that focal loss function, combined with a proper set of class weights yield better segmentation results than the base loss, Softmax cross entropy. Our method can be used in combination with planning algorithm to offer a complete, safe and cost-efficient solution to autonomous infrastructure inspection.
ROSep 11, 2018
Simultaneous Localization and Layout Model Selection in Manhattan WorldsArmon Shariati, Bernd Pfrommer, Camillo J. Taylor
In this paper, we will demonstrate how Manhattan structure can be exploited to transform the Simultaneous Localization and Mapping (SLAM) problem, which is typically solved by a nonlinear optimization over feature positions, into a model selection problem solved by a convex optimization over higher order layout structures, namely walls, floors, and ceilings. Furthermore, we show how our novel formulation leads to an optimization procedure that automatically performs data association and loop closure and which ultimately produces the simplest model of the environment that is consistent with the available measurements. We verify our method on real world data sets collected with various sensing modalities.
CVApr 1, 2018
Robust Fruit Counting: Combining Deep Learning, Tracking, and Structure from MotionXu Liu, Steven W. Chen, Shreyas Aditya et al.
We present a novel fruit counting pipeline that combines deep segmentation, frame to frame tracking, and 3D localization to accurately count visible fruits across a sequence of images. Our pipeline works on image streams from a monocular camera, both in natural light, as well as with controlled illumination at night. We first train a Fully Convolutional Network (FCN) and segment video frame images into fruit and non-fruit pixels. We then track fruits across frames using the Hungarian Algorithm where the objective cost is determined from a Kalman Filter corrected Kanade-Lucas-Tomasi (KLT) Tracker. In order to correct the estimated count from tracking process, we combine tracking results with a Structure from Motion (SfM) algorithm to calculate relative 3D locations and size estimates to reject outliers and double counted fruit tracks. We evaluate our algorithm by comparing with ground-truth human-annotated visual counts. Our results demonstrate that our pipeline is able to accurately and reliably count fruits across image sequences, and the correction step can significantly improve the counting accuracy and robustness. Although discussed in the context of fruit counting, our work can extend to detection, tracking, and counting of a variety of other stationary features of interest such as leaf-spots, wilt, and blossom.
CVSep 12, 2017
Unsupervised Deep Homography: A Fast and Robust Homography Estimation ModelTy Nguyen, Steven W. Chen, Shreyas S. Shivakumar et al.
Homography estimation between multiple aerial images can provide relative pose estimation for collaborative autonomous exploration and monitoring. The usage on a robotic system requires a fast and robust homography estimation algorithm. In this study, we propose an unsupervised learning algorithm that trains a Deep Convolutional Neural Network to estimate planar homographies. We compare the proposed algorithm to traditional feature-based and direct methods, as well as a corresponding supervised learning algorithm. Our empirical results demonstrate that compared to traditional approaches, the unsupervised algorithm achieves faster inference speed, while maintaining comparable or better accuracy and robustness to illumination variation. In addition, on both a synthetic dataset and representative real-world aerial dataset, our unsupervised method has superior adaptability and performance compared to the supervised deep learning method.
CVSep 26, 2013
Online Algorithms for Factorization-Based Structure from MotionRyan Kennedy, Laura Balzano, Stephen J. Wright et al.
We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and sparse outliers. We demonstrate our methods on both real and synthetic sequences and show that they perform well in both online and batch settings. We also provide an implementation which is able to produce 3D models in real time using a laptop with a webcam.