CVJun 2Code
Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor FusionOskar Natan, Jun Miura
We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.
ROMay 31Code
DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing AvoidanceOskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky et al.
Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.
ROApr 12, 2022Code
End-to-end Autonomous Driving with Semantic Depth Cloud Mapping and Multi-agentOskar Natan, Jun Miura
Focusing on the task of point-to-point navigation for an autonomous driving vehicle, we propose a novel deep learning model trained with end-to-end and multi-task learning manners to perform both perception and control tasks simultaneously. The model is used to drive the ego vehicle safely by following a sequence of routes defined by the global planner. The perception part of the model is used to encode high-dimensional observation data provided by an RGBD camera while performing semantic segmentation, semantic depth cloud (SDC) mapping, and traffic light state and stop sign prediction. Then, the control part decodes the encoded features along with additional information provided by GPS and speedometer to predict waypoints that come with a latent feature space. Furthermore, two agents are employed to process these outputs and make a control policy that determines the level of steering, throttle, and brake as the final action. The model is evaluated on CARLA simulator with various scenarios made of normal-adversarial situations and different weathers to mimic real-world conditions. In addition, we do a comparative study with some recent models to justify the performance in multiple aspects of driving. Moreover, we also conduct an ablation study on SDC mapping and multi-agent to understand their roles and behavior. As a result, our model achieves the highest driving score even with fewer parameters and computation load. To support future studies, we share our codes at https://github.com/oskarnatan/end-to-end-driving.
ROJul 20, 2022Code
DeepIPC: Deeply Integrated Perception and Control for an Autonomous Vehicle in Real EnvironmentsOskar Natan, Jun Miura
In this work, we introduce DeepIPC, a novel end-to-end model tailored for autonomous driving, which seamlessly integrates perception and control tasks. Unlike traditional models that handle these tasks separately, DeepIPC innovatively combines a perception module, which processes RGBD images for semantic segmentation and generates bird's eye view (BEV) mappings, with a controller module that utilizes these insights along with GNSS and angular speed measurements to accurately predict navigational waypoints. This integration allows DeepIPC to efficiently translate complex environmental data into actionable driving commands. Our comprehensive evaluation demonstrates DeepIPC's superior performance in terms of drivability and multi-task efficiency across diverse real-world scenarios, setting a new benchmark for end-to-end autonomous driving systems with a leaner model architecture. The experimental results underscore DeepIPC's potential to significantly enhance autonomous vehicular navigation, promising a step forward in the development of autonomous driving technologies. For further insights and replication, we will make our code and datasets available at https://github.com/oskarnatan/DeepIPC.
ROJul 13, 2023Code
DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous VehicleOskar Natan, Jun Miura
We present DeepIPCv2, an autonomous driving model that perceives the environment using a LiDAR sensor for more robust drivability, especially when driving under poor illumination conditions where everything is not clearly visible. DeepIPCv2 takes a set of LiDAR point clouds as the main perception input. Since point clouds are not affected by illumination changes, they can provide a clear observation of the surroundings no matter what the condition is. This results in a better scene understanding and stable features provided by the perception module to support the controller module in estimating navigational control properly. To evaluate its performance, we conduct several tests by deploying the model to predict a set of driving records and perform real automated driving under three different conditions. We also conduct ablation and comparative studies with some recent models to justify its performance. Based on the experimental results, DeepIPCv2 shows a robust performance by achieving the best drivability in all driving scenarios. Furthermore, to support future research, we will upload the codes and data to https://github.com/oskarnatan/DeepIPCv2.
ROAug 13, 2022
Online Refinement of a Scene Recognition Model for Mobile Robots by Observing Human's Interaction with EnvironmentsShigemichi Matsuzaki, Hiroaki Masuzawa, Jun Miura
This paper describes a method of online refinement of a scene recognition model for robot navigation considering traversable plants, flexible plant parts which a robot can push aside while moving. In scene recognition systems that consider traversable plants growing out to the paths, misclassification may lead the robot to getting stuck due to the traversable plants recognized as obstacles. Yet, misclassification is inevitable in any estimation methods. In this work, we propose a framework that allows for refining a semantic segmentation model on the fly during the robot's operation. We introduce a few-shot segmentation based on weight imprinting for online model refinement without fine-tuning. Training data are collected via observation of a human's interaction with the plant parts. We propose novel robust weight imprinting to mitigate the effect of noise included in the masks generated by the interaction. The proposed method was evaluated through experiments using real-world data and shown to outperform an ordinary weight imprinting and provide competitive results to fine-tuning with model distillation while requiring less computational cost.
CVMar 2, 2023
Multi-Source Soft Pseudo-Label Learning with Domain Similarity-based Weighting for Semantic SegmentationShigemichi Matsuzaki, Hiroaki Masuzawa, Jun Miura
This paper describes a method of domain adaptive training for semantic segmentation using multiple source datasets that are not necessarily relevant to the target dataset. We propose a soft pseudo-label generation method by integrating predicted object probabilities from multiple source models. The prediction of each source model is weighted based on the estimated domain similarity between the source and the target datasets to emphasize contribution of a model trained on a source that is more similar to the target and generate reasonable pseudo-labels. We also propose a training method using the soft pseudo-labels considering their entropy to fully exploit information from the source datasets while suppressing the influence of possibly misclassified pixels. The experiments show comparative or better performance than our previous work and another existing multi-source domain adaptation method, and applicability to a variety of target environments.
ROOct 27, 2025Code
Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot NavigationOskar Natan, Jun Miura
We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repository at https://github.com/oskarnatan/Seq-DeepIPC.
ROMar 20, 2024
Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMsYusuke Mikami, Andrew Melnik, Jun Miura et al.
We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: https://natural-language-as-policies.github.io/
ROAug 2, 2021
Image-based scene recognition for robot navigation considering traversable plants and its manual annotation-free trainingShigemichi Matsuzaki, Hiroaki Masuzawa, Jun Miura
This paper describes a method of estimating the traversability of plant parts covering a path and navigating through them for mobile robots operating in plant-rich environments. Conventional mobile robots rely on scene recognition methods that consider only the geometric information of the environment. Those methods, therefore, cannot recognize paths as traversable when they are covered by flexible plants. In this paper, we present a novel framework of image-based scene recognition to realize navigation in such plant-rich environments. Our recognition model exploits a semantic segmentation branch for general object classification and a traversability estimation branch for estimating pixel-wise traversability. The semantic segmentation branch is trained using an unsupervised domain adaptation method and the traversability estimation branch is trained with label images generated from the robot's traversal experience during the data acquisition phase, coined traversability masks. The training procedure of the entire model is, therefore, free from manual annotation. In our experiment, we show that the proposed recognition framework is capable of distinguishing traversable plants more accurately than a conventional semantic segmentation with traversable plant and non-traversable plant classes, and an existing image-based traversability estimation method. We also conducted a real-world experiment and confirmed that the robot with the proposed recognition method successfully navigated in plant-rich environments.
ROApr 21, 2021
Multi-task Learning with Attention for End-to-end Autonomous DrivingKeishi Ishihara, Anssi Kanervisto, Jun Miura et al.
Autonomous driving systems need to handle complex scenarios such as lane following, avoiding collisions, taking turns, and responding to traffic signals. In recent years, approaches based on end-to-end behavioral cloning have demonstrated remarkable performance in point-to-point navigational scenarios, using a realistic simulator and standard benchmarks. Offline imitation learning is readily available, as it does not require expensive hand annotation or interaction with the target environment, but it is difficult to obtain a reliable system. In addition, existing methods have not specifically addressed the learning of reaction for traffic lights, which are a rare occurrence in the training datasets. Inspired by the previous work on multi-task learning and attention modeling, we propose a novel multi-task attention-aware network in the conditional imitation learning (CIL) framework. This does not only improve the success rate of standard benchmarks, but also the ability to react to traffic lights, which we show with standard benchmarks.
CVFeb 12, 2021
Multi-source Pseudo-label Learning of Semantic Segmentation for the Scene Recognition of Agricultural Mobile RobotsShigemichi Matsuzaki, Jun Miura, Hiroaki Masuzawa
This paper describes a novel method of training a semantic segmentation model for scene recognition of agricultural mobile robots exploiting publicly available datasets of outdoor scenes that are different from the target greenhouse environments. Semantic segmentation models require abundant labels given by tedious manual annotation. A method to work around it is unsupervised domain adaptation (UDA) that transfers knowledge from labeled source datasets to unlabeled target datasets. However, the effectiveness of existing methods is not well studied in adaptation between heterogeneous environments, such as urban scenes and greenhouses. In this paper, we propose a method to train a semantic segmentation model for greenhouse images without manually labeled datasets of greenhouse images. The core of our idea is to use multiple rich image datasets of different environments with segmentation labels to generate pseudo-labels for the target images to effectively transfer the knowledge from multiple sources and realize a precise training of semantic segmentation. Along with the pseudo-label generation, we introduce state-of-the-art methods to deal with noise in the pseudo-labels to further improve the performance. We demonstrate in experiments with multiple greenhouse datasets that our proposed method improves the performance compared to the single-source baselines and an existing approach.
CVOct 5, 2019
Early Estimation of User's Intention of Tele-Operation Using Object Affordance and Hand Motion in a Dual First-Person VisionMotoki Kojima, Jun Miura
This paper describes a method of estimating the intention of a user's motion in a robot tele-operation scenario. One of the issues in tele-operation is latency, which occurs due to various reasons such as a slow robot motion and a narrow communication channel. An effective way of reducing the latency is to estimate the human intention of motions and to move the robot proactively. To enable a reliable early intention estimation, we use both hand motion and object affordances in a dual first-person vision (robot and user) with an HMD. Experimental results in an object pickup scenario show the effectiveness of the method.