11.5CVJun 1
FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature ManifoldsRai Hisada, Kanji Tanaka
This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.
CVJun 19, 2023
PartSLAM: Unsupervised Part-based Scene Modeling for Fast Succinct Map MatchingShogo Hanada, Kanji Tanaka
In this paper, we explore the challenging 1-to-N map matching problem, which exploits a compact description of map data, to improve the scalability of map matching techniques used by various robot vision tasks. We propose a first method explicitly aimed at fast succinct map matching, which consists only of map-matching subtasks. These tasks include offline map matching attempts to find a compact part-based scene model that effectively explains each map using fewer larger parts. The tasks also include an online map matching attempt to efficiently find correspondence between the part-based maps. Our part-based scene modeling approach is unsupervised and uses common pattern discovery (CPD) between the input and known reference maps. This enables a robot to learn a compact map model without human intervention. We also present a practical implementation that uses the state-of-the-art CPD technique of randomized visual phrases (RVP) with a compact bounding box (BB) based part descriptor, which consists of keypoint and descriptor BBs. The results of our challenging map-matching experiments, which use a publicly available radish dataset, show that the proposed approach achieves successful map matching with significant speedup and a compact description of map data that is tens of times more compact. Although this paper focuses on the standard 2D point-set map and the BB-based part representation, we believe our approach is sufficiently general to be applicable to a broad range of map formats, such as the 3D point cloud map, as well as to general bounding volumes and other compact part representations.
CVJun 28, 2023
Lifelong Change Detection: Continuous Domain Adaptation for Small Object Change Detection in Every Robot NavigationKoji Takeda, Kanji Tanaka, Yoshimasa Nakamura
The recently emerging research area in robotics, ground view change detection, suffers from its ill-posed-ness because of visual uncertainty combined with complex nonlinear perspective projection. To regularize the ill-posed-ness, the commonly applied supervised learning methods (e.g., CSCD-Net) rely on manually annotated high-quality object-class-specific priors. In this work, we consider general application domains where no manual annotation is available and present a fully self-supervised approach. The present approach adopts the powerful and versatile idea that object changes detected during everyday robot navigation can be reused as additional priors to improve future change detection tasks. Furthermore, a robustified framework is implemented and verified experimentally in a new challenging practical application scenario: ground-view small object change detection.
CVMar 29, 2022
Domain Invariant Siamese Attention Mask for Small Object Change Detection via Everyday Indoor Robot NavigationKoji Takeda, Kanji Tanaka, Yoshimasa Nakamura
The problem of image change detection via everyday indoor robot navigation is explored from a novel perspective of the self-attention technique. Detecting semantically non-distinctive and visually small changes remains a key challenge in the robotics community. Intuitively, these small non-distinctive changes may be better handled by the recent paradigm of the attention mechanism, which is the basic idea of this work. However, existing self-attention models require significant retraining cost per domain, so it is not directly applicable to robotics applications. We propose a new self-attention technique with an ability of unsupervised on-the-fly domain adaptation, which introduces an attention mask into the intermediate layer of an image change detection model, without modifying the input and output layers of the model. Experiments, in which an indoor robot aims to detect visually small changes in everyday navigation, demonstrate that our attention technique significantly boosts the state-of-the-art image change detection model.
ROApr 22, 2022
Active Domain-Invariant Self-Localization Using Ego-Centric and World-Centric MapsKanya Kurauchi, Kanji Tanaka, Ryogo Yamamoto et al.
The training of a next-best-view (NBV) planner for visual place recognition (VPR) is a fundamentally important task in autonomous robot navigation, for which a typical approach is the use of visual experiences that are collected in the target domain as training data. However, the collection of a wide variety of visual experiences in everyday navigation is costly and prohibitive for real-time robotic applications. We address this issue by employing a novel {\it domain-invariant} NBV planner. A standard VPR subsystem based on a convolutional neural network (CNN) is assumed to be available, and its domain-invariant state recognition ability is proposed to be transferred to train the domain-invariant NBV planner. Specifically, we divide the visual cues that are available from the CNN model into two types: the output layer cue (OLC) and intermediate layer cue (ILC). The OLC is available at the output layer of the CNN model and aims to estimate the state of the robot (e.g., the robot viewpoint) with respect to the world-centric view coordinate system. The ILC is available within the middle layers of the CNN model as a high-level description of the visual content (e.g., a saliency image) with respect to the ego-centric view. In our framework, the ILC and OLC are mapped to a state vector and subsequently used to train a multiview NBV planner via deep reinforcement learning. Experiments using the public NCLT dataset validate the effectiveness of the proposed method.
CVMar 26, 2022
Exploring Self-Attention for Visual Intersection ClassificationHaruki Nakata, Kanji Tanaka, Koji Takeda
In robot vision, self-attention has recently emerged as a technique for capturing non-local contexts. In this study, we introduced a self-attention mechanism into the intersection recognition system as a method to capture the non-local contexts behind the scenes. An intersection classification system comprises two distinctive modules: (a) a first-person vision (FPV) module, which uses a short egocentric view sequence as the intersection is passed, and (b) a third-person vision (TPV) module, which uses a single view immediately before entering the intersection. The self-attention mechanism is effective in the TPV module because most parts of the local pattern (e.g., road edges, buildings, and sky) are similar to each other, and thus the use of a non-local context (e.g., the angle between two diagonal corners around an intersection) would be effective. This study makes three major contributions. First, we proposed a self-attention-based approach for intersection classification using TPVs. Second, we presented a practical system in which a self-attention-based TPV module is combined with an FPV module to improve the overall recognition performance. Finally, experiments using the public KITTI dataset show that the above self-attention-based system outperforms conventional recognition based on local patterns and recognition based on convolution operations.
CVAug 3, 2022
Compressive Self-localization Using Relative Attribute EmbeddingRyogo Yamamoto, Kanji Tanaka
The use of relative attribute (e.g., beautiful, safe, convenient) -based image embeddings in visual place recognition, as a domain-adaptive compact image descriptor that is orthogonal to the typical approach of absolute attribute (e.g., color, shape, texture) -based image embeddings, is explored in this paper.
CVOct 24, 2023
Cross-view Self-localization from Synthesized Scene-graphsRyogo Yamamoto, Kanji Tanaka
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.
ROSep 23, 2024
CON: Continual Object Navigation via Data-Free Inter-Agent Knowledge Transfer in Unseen and Unfamiliar PlacesKouki Terashima, Daiki Iwata, Kanji Tanaka
This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches -- frontier-driven methods using object feature maps and learning-based ON using neural state-action maps -- present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.
ROSep 30, 2023
Walking = Traversable? : Traversability Prediction via Multiple Human Object Tracking under OcclusionJonathan Tay Yu Liang, Kanji Tanaka
The emerging ``Floor plan from human trails (PfH)" technique has great potential for improving indoor robot navigation by predicting the traversability of occluded floors. This study presents an innovative approach that replaces first-person-view sensors with a third-person-view monocular camera mounted on the observer robot. This approach can gather measurements from multiple humans, expanding its range of applications. The key idea is to use two types of trackers, SLAM and MOT, to monitor stationary objects and moving humans and assess their interactions. This method achieves stable predictions of traversability even in challenging visual scenarios, such as occlusions, nonlinear perspectives, depth uncertainty, and intersections involving multiple humans. Additionally, we extend map quality metrics to apply to traversability maps, facilitating future research. We validate our proposed method through fusion and comparison with established techniques.
1.6ROApr 1
A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person TrackingYuki Minase, Kanji Tanaka
Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy -- referred to as ``Fine-grained Differential Learning Rate Strategy'' -- we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7\%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.
LGMar 13, 2024
Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge TransferKenta Tsukahara, Kanji Tanaka, Daiki Iwata
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.
ROMar 26, 2025
LGR: LLM-Guided Ranking of Frontiers for Object Goal NavigationMitsuaki Uno, Kanji Tanaka, Daiki Iwata et al.
Object Goal Navigation (OGN) is a fundamental task for robots and AI, with key applications such as mobile robot image databases (MRID). In particular, mapless OGN is essential in scenarios involving unknown or dynamic environments. This study aims to enhance recent modular mapless OGN systems by leveraging the commonsense reasoning capabilities of large language models (LLMs). Specifically, we address the challenge of determining the visiting order in frontier-based exploration by framing it as a frontier ranking problem. Our approach is grounded in recent findings that, while LLMs cannot determine the absolute value of a frontier, they excel at evaluating the relative value between multiple frontiers viewed within a single image using the view image as context. We dynamically manage the frontier list by adding and removing elements, using an LLM as a ranking model. The ranking results are represented as reciprocal rank vectors, which are ideal for multi-view, multi-query information fusion. We validate the effectiveness of our method through evaluations in Habitat-Sim.
ROMar 17, 2025
Dynamic-Dark SLAM: RGB-Thermal Cooperative Robot Vision Strategy for Multi-Person Tracking in Both Well-Lit and Low-Light ScenesTatsuro Sakai, Kanji Tanaka, Yuki Minase et al.
In robot vision, thermal cameras hold great potential for recognizing humans even in complete darkness. However, their application to multi-person tracking (MPT) has been limited due to data scarcity and the inherent difficulty of distinguishing individuals. In this study, we propose a cooperative MPT system that utilizes co-located RGB and thermal cameras, where pseudo-annotations (bounding boxes and person IDs) are used to train both RGB and thermal trackers. Evaluation experiments demonstrate that the thermal tracker performs robustly in both bright and dark environments. Moreover, the results suggest that a tracker-switching strategy -- guided by a binary brightness classifier -- is more effective for information integration than a tracker-fusion approach. As an application example, we present an image change pattern recognition (ICPR) method, the ``human-as-landmark,'' which combines two key properties: the thermal recognizability of humans in dark environments and the rich landmark characteristics -- appearance, geometry, and semantics -- of static objects (occluders). Whereas conventional SLAM focuses on mapping static landmarks in well-lit environments, the present study takes a first step toward a new Human-Only SLAM paradigm, ``Dynamic-Dark SLAM,'' which aims to map even dynamic landmarks in complete darkness. Additionally, this study demonstrates that knowledge transfer between thermal and depth modalities enables reliable person tracking using low-resolution 3D LiDAR data without RGB input, contributing an important advance toward cross-robot SLAM systems.
16.0CVMar 9
From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal NavigationYudai Noda, Kanji Tanaka
Object-Goal Navigation (ObjectNav) requires an agent to find and navigate to a target object category in unknown environments. While recent Large Language Model (LLM)-based agents exhibit zero-shot reasoning, they often rely on a "reactive" paradigm that lacks explicit spatial memory, leading to redundant exploration and myopic behaviors. To address these limitations, we propose a transition from reactive AI to "Map-Based AI" by integrating LLM-based semantic inference with a hybrid topological-grid mapping system. Our framework employs a fine-tuned Llama-2 model via Low-Rank Adaptation (LoRA) to infer semantic zone categories and target existence probabilities from verbalized object observations. In this study, a "zone" is defined as a functional area described by the set of observed objects, providing crucial semantic co-occurrence cues for finding the target. This semantic information is integrated into a topological graph, enabling the agent to prioritize high-probability areas and perform systematic exploration via Traveling Salesman Problem (TSP) optimization. Evaluations in the AI2-THOR simulator demonstrate that our approach significantly outperforms traditional frontier exploration and reactive LLM baselines, achieving a superior Success Rate (SR) and Success weighted by Path Length (SPL).
CVJun 17, 2024
DRIP: Discriminative Rotation-Invariant Pole Landmark Descriptor for 3D LiDAR LocalizationDingrui Li, Dedi Guo, Kanji Tanaka
In 3D LiDAR-based robot self-localization, pole-like landmarks are gaining popularity as lightweight and discriminative landmarks. This work introduces a novel approach called "discriminative rotation-invariant poles," which enhances the discriminability of pole-like landmarks while maintaining their lightweight nature. Unlike conventional methods that model a pole landmark as a 3D line segment perpendicular to the ground, we propose a simple yet powerful approach that includes not only the line segment's main body but also its surrounding local region of interest (ROI) as part of the pole landmark. Specifically, we describe the appearance, geometry, and semantic features within this ROI to improve the discriminability of the pole landmark. Since such pole landmarks are no longer rotation-invariant, we introduce a novel rotation-invariant convolutional neural network that automatically and efficiently extracts rotation-invariant features from input point clouds for recognition. Furthermore, we train a pole dictionary through unsupervised learning and use it to compress poles into compact pole words, thereby significantly reducing real-time costs while maintaining optimal self-localization performance. Monte Carlo localization experiments using publicly available NCLT dataset demonstrate that the proposed method improves a state-of-the-art pole-based localization framework.
CVMay 10, 2024
Zero-shot Degree of Ill-posedness Estimation for Active Small Object Change DetectionKoji Takeda, Kanji Tanaka, Yoshimasa Nakamura et al.
In everyday indoor navigation, robots often needto detect non-distinctive small-change objects (e.g., stationery,lost items, and junk, etc.) to maintain domain knowledge. Thisis most relevant to ground-view change detection (GVCD), a recently emerging research area in the field of computer vision.However, these existing techniques rely on high-quality class-specific object priors to regularize a change detector modelthat cannot be applied to semantically nondistinctive smallobjects. To address ill-posedness, in this study, we explorethe concept of degree-of-ill-posedness (DoI) from the newperspective of GVCD, aiming to improve both passive and activevision. This novel DoI problem is highly domain-dependent,and manually collecting fine-grained annotated training datais expensive. To regularize this problem, we apply the conceptof self-supervised learning to achieve efficient DoI estimationscheme and investigate its generalization to diverse datasets.Specifically, we tackle the challenging issue of obtaining self-supervision cues for semantically non-distinctive unseen smallobjects and show that novel "oversegmentation cues" from openvocabulary semantic segmentation can be effectively exploited.When applied to diverse real datasets, the proposed DoI modelcan boost state-of-the-art change detection models, and it showsstable and consistent improvements when evaluated on real-world datasets.
RODec 26, 2023
Recursive Distillation for Open-Set Distributed Robot LocalizationKenta Tsukahara, Kanji Tanaka
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available for the target workspace. However, this is not necessarily true when a robot travels around the general open world. This work introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot (``student") can ask the other robots it meets at unfamiliar places (``teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and then used for continual learning of the student model under domain, class, and vocabulary incremental setup. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, so that it can handle various types of open-set teachers, including those uncooperative, untrainable (e.g., image retrieval engines), or black-box teachers (i.e., data privacy). In this paper, we investigate a ranking function as an instance of such generic models, using a challenging data-free recursive distillation scenario, where a student once trained can recursively join the next-generation open teacher set.
CVMay 10, 2023
A Multi-modal Approach to Single-modal Visual Place ClassificationTomoya Iwasaki, Kanji Tanaka, Kenta Tsukahara
Visual place classification from a first-person-view monocular RGB image is a fundamental problem in long-term robot navigation. A difficulty arises from the fact that RGB image classifiers are often vulnerable to spatial and appearance changes and degrade due to domain shifts, such as seasonal, weather, and lighting differences. To address this issue, multi-sensor fusion approaches combining RGB and depth (D) (e.g., LIDAR, radar, stereo) have gained popularity in recent years. Inspired by these efforts in multimodal RGB-D fusion, we explore the use of pseudo-depth measurements from recently-developed techniques of ``domain invariant" monocular depth estimation as an additional pseudo depth modality, by reformulating the single-modal RGB image classification task as a pseudo multi-modal RGB-D classification problem. Specifically, a practical, fully self-supervised framework for training, appropriately processing, fusing, and classifying these two modalities, RGB and pseudo-D, is described. Experiments on challenging cross-domain scenarios using public NCLT datasets validate effectiveness of the proposed framework.
CVMay 10, 2023
Active Semantic Localization with Graph Neural EmbeddingMitsuki Yoshida, Kanji Tanaka, Ryogo Yamamoto et al.
Semantic localization, i.e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e.g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e.g., graph neural SLAM, ego-centric topological map). However, most existing works on semantic localization focus on passive vision tasks without viewpoint planning, or rely on additional rich modalities (e.g., depth measurements). Thus, the problem is largely unsolved. In this work, we explore a lightweight, entirely CPU-based, domain-adaptive semantic localization framework, called graph neural localizer. Our approach is inspired by two recently emerging technologies: (1) Scene graph, which combines the viewpoint- and appearance- invariance of local and global features; (2) Graph neural network, which enables direct learning/recognition of graph data (i.e., non-vector data). Specifically, a graph convolutional neural network is first trained as a scene graph classifier for passive vision, and then its knowledge is transferred to a reinforcement-learning planner for active vision. Experiments on two scenarios, self-supervised learning and unsupervised domain adaptation, using a photo-realistic Habitat simulator validate the effectiveness of the proposed method.
CVSep 9, 2021
Open-World Distributed Robot Self-Localization with Transferable Visual Vocabulary and Both Absolute and Relative FeaturesMitsuki Yoshida, Ryogo Yamamoto, Daiki Iwata et al.
Visual robot self-localization is a fundamental problem in visual robot navigation and has been studied across various problem settings, including monocular and sequential localization. However, many existing studies focus primarily on single-robot scenarios, with limited exploration into general settings involving diverse robots connected through wireless networks with constrained communication capacities, such as open-world distributed robot systems. In particular, issues related to the transfer and sharing of key knowledge, such as visual descriptions and visual vocabulary, between robots have been largely neglected. This work introduces a new self-localization framework designed for open-world distributed robot systems that maintains state-of-the-art performance while offering two key advantages: (1) it employs an unsupervised visual vocabulary model that maps to multimodal, lightweight, and transferable visual features, and (2) the visual vocabulary itself is a lightweight and communication-friendly model. Although the primary focus is on encoding monocular view images, the framework can be easily extended to sequential localization applications. By utilizing complementary similarity-preserving features -- both absolute and relative -- the framework meets the requirements for being unsupervised, multimodal, lightweight, and transferable. All features are learned and recognized using a lightweight graph neural network and scene graph. The effectiveness of the proposed method is validated in both passive and active self-localization scenarios.
CVApr 9, 2021
TaylorMade VDD: Domain-adaptive Visual Defect Detector for High-mix Low-volume Production of Non-convex Cylindrical Metal ObjectsKyosuke Tashiro, Koji Takeda, Kanji Tanaka et al.
Visual defect detection (VDD) for high-mix low-volume production of non-convex metal objects, such as high-pressure cylindrical piping joint parts (VDD-HPPPs), is challenging because subtle difference in domain (e.g., metal objects, imaging device, viewpoints, lighting) significantly affects the specular reflection characteristics of individual metal object types. In this paper, we address this issue by introducing a tailor-made VDD framework that can be automatically adapted to a new domain. Specifically, we formulate this adaptation task as the problem of network architecture search (NAS) on a deep object-detection network, in which the network architecture is searched via reinforcement learning. We demonstrate the effectiveness of the proposed framework using the VDD-HPPPs task as a factory case study. Experimental results show that the proposed method achieved higher burr detection accuracy compared with the baseline method for data with different training/test domains for the non-convex HPPPs, which are particularly affected by domain shifts.
CVFeb 23, 2021
Domain-invariant NBV Planner for Active Cross-domain Self-localizationKanji Tanaka
Pole-like landmark has received increasing attention as a domain-invariant visual cue for visual robot self-localization across domains (e.g., seasons, times of day, weathers). However, self-localization using pole-like landmarks can be ill-posed for a passive observer, as many viewpoints may not provide any pole-like landmark view. To alleviate this problem, we consider an active observer and explore a novel "domain-invariant" next-best-view (NBV) planner that attains consistent performance over different domains (i.e., maintenance-free), without requiring the expensive task of training data collection and retraining. In our approach, a novel multi-encoder deep convolutional neural network enables to detect domain invariant pole-like landmarks, which are then used as the sole input to a model-free deep reinforcement learning -based domain-invariant NBV planner. Further, we develop a practical system for active self-localization using sparse invariant landmarks and dense discriminative landmarks. In experiments, we demonstrate that the proposed method is effective both in efficient landmark detection and in discriminative self-localization.
CVNov 1, 2020
Dark Reciprocal-Rank: Boosting Graph-Convolutional Self-Localization Network via Teacher-to-student Knowledge TransferKoji Takeda, Kanji Tanaka
In visual robot self-localization, graph-based scene representation and matching have recently attracted research interest as robust and discriminative methods for selflocalization. Although effective, their computational and storage costs do not scale well to large-size environments. To alleviate this problem, we formulate self-localization as a graph classification problem and attempt to use the graph convolutional neural network (GCN) as a graph classification engine. A straightforward approach is to use visual feature descriptors that are employed by state-of-the-art self-localization systems, directly as graph node features. However, their superior performance in the original self-localization system may not necessarily be replicated in GCN-based self-localization. To address this issue, we introduce a novel teacher-to-student knowledge-transfer scheme based on rank matching, in which the reciprocal-rank vector output by an off-the-shelf state-of-the-art teacher self-localization model is used as the dark knowledge to transfer. Experiments indicate that the proposed graph-convolutional self-localization network can significantly outperform state-of-the-art self-localization systems, as well as the teacher classifier.
CVJan 22, 2019
Use of First and Third Person Views for Deep Intersection ClassificationKoji Takeda, Kanji Tanaka
We explore the problem of intersection classification using monocular on-board passive vision, with the goal of classifying traffic scenes with respect to road topology. We divide the existing approaches into two broad categories according to the type of input data: (a) first person vision (FPV) approaches, which use an egocentric view sequence as the intersection is passed; and (b) third person vision (TPV) approaches, which use a single view immediately before entering the intersection. The FPV and TPV approaches each have advantages and disadvantages. Therefore, we aim to combine them into a unified deep learning framework. Experimental results show that the proposed FPV-TPV scheme outperforms previous methods and only requires minimal FPV/TPV measurements.
CVSep 16, 2017
Long-Term Ensemble Learning of Visual Place ClassifiersXiaoxiao Fei, Kanji Tanaka, Yichu Fang et al.
This paper addresses the problem of cross-season visual place classification (VPC) from a novel perspective of long-term map learning. Our goal is to enable transfer learning efficiently from one season to the next, at a small constant cost, and without wasting the robot's available long-term-memory by memorizing very large amounts of training data. To realize a good tradeoff between generalization and specialization abilities, we employ an ensemble of convolutional neural network (DCN) classifiers and consider the task of scheduling (when and which classifiers to retrain), given a previous season's DCN classifiers as the sole prior knowledge. We present a unified framework for retraining scheduling and discuss practical implementation strategies. Furthermore, we address the task of partitioning a robot's workspace into places to define place classes in an unsupervised manner, rather than using uniform partitioning, so as to maximize VPC performance. Experiments using the publicly available NCLT dataset revealed that retraining scheduling of a DCN classifier ensemble is crucial and performance is significantly increased by using planned scheduling.
ROSep 8, 2016
Deformable Map Matching for Uncertain Loop-Less MapsKanji Tanaka
In the classical context of robotic mapping and localization, map matching is typically defined as the task of finding a rigid transformation (i.e., 3DOF rotation/translation on the 2D moving plane) that aligns the query and reference maps built by mobile robots. This definition is valid in loop-rich trajectories that enable a mapper robot to close many loops, for which precise maps can be assumed. The same cannot be said about the newly emerging autonomous navigation and driving systems, which typically operate in loop-less trajectories that have no large loop (e.g., straight paths). In this paper, we propose a solution that overcomes this limitation by merging the two maps. Our study is motivated by the observation that even when there is no large loop in either the query or reference map, many loops can often be obtained in the merged map. We add two new aspects to map matching: (1) image retrieval with discriminative deep convolutional neural network (DCNN) features, which efficiently generates a small number of good initial alignment hypotheses; and (2) map merge, which jointly deforms the two maps to minimize differences in shape between them. To realize practical computation time, we also present a preemption scheme that avoids excessive evaluation of useless map-matching hypotheses. To verify our approach experimentally, we created a novel collection of uncertain loop-less maps by utilizing the recently published North Campus Long-Term (NCLT) dataset and its ground-truth GPS data. The results obtained using these map collections confirm that our approach improves on previous map-matching approaches.
CVAug 6, 2016
Multi-Model Hypothesize-and-Verify Approach for Incremental Loop Closure VerificationKanji Tanaka
Loop closure detection, which is the task of identifying locations revisited by a robot in a sequence of odometry and perceptual observations, is typically formulated as a visual place recognition (VPR) task. However, even state-of-the-art VPR techniques generate a considerable number of false positives as a result of confusing visual features and perceptual aliasing. In this paper, we propose a robust incremental framework for loop closure detection, termed incremental loop closure verification. Our approach reformulates the problem of loop closure detection as an instance of a multi-model hypothesize-and-verify framework, in which multiple loop closure hypotheses are generated and verified in terms of the consistency between loop closure hypotheses and VPR constraints at multiple viewpoints along the robot's trajectory. Furthermore, we consider the general incremental setting of loop closure detection, in which the system must update both the set of VPR constraints and that of loop closure hypotheses when new constraints or hypotheses arrive during robot navigation. Experimental results using a stereo SLAM system and DCNN features and visual odometry validate effectiveness of the proposed approach.
CVAug 6, 2016
Compressive Change Retrieval for Moving Object DetectionTomoya Murase, Kanji Tanaka
Change detection, or anomaly detection, from street-view images acquired by an autonomous robot at multiple different times, is a major problem in robotic mapping and autonomous driving. Formulation as an image comparison task, which operates on a given pair of query and reference images is common to many existing approaches to this problem. Unfortunately, providing relevant reference images is not straightforward. In this paper, we propose a novel formulation for change detection, termed compressive change retrieval, which can operate on a query image and similar reference images retrieved from the web. Compared to previous formulations, there are two sources of difficulty. First, the retrieved reference images may frequently contain non-relevant reference images, because even state-of-the-art place-recognition techniques suffer from retrieval noise. Second, image comparison needs to be conducted in a compressed domain to minimize the storage cost of large collections of street-view images. To address the above issues, we also present a practical change detection algorithm that uses compressed bag-of-words (BoW) image representation as a scalable solution. The results of experiments conducted on a practical change detection task, "moving object detection (MOD)," using the publicly available Malaga dataset validate the effectiveness of the proposed approach.
CVSep 25, 2015
Self-localization Using Visual Experience Across DomainsTaisho Tsukamoto, Kanji Tanaka
In this study, we aim to solve the single-view robot self-localization problem by using visual experience across domains. Although the bag-of-words method constitutes a popular approach to single-view localization, it fails badly when it's visual vocabulary is learned and tested in different domains. Further, we are interested in using a cross-domain setting, in which the visual vocabulary is learned in different seasons and routes from the input query/database scenes. Our strategy is to mine a cross-domain visual experience, a library of raw visual images collected in different domains, to discover the relevant visual patterns that effectively explain the input scene, and use them for scene retrieval. In particular, we show that the appearance and the pose of the mined visual patterns of a query scene can be efficiently and discriminatively matched against those of the database scenes by employing image-to-class distance and spatial pyramid matching. Experimental results obtained using a novel cross-domain dataset show that our system achieves promising results despite our visual vocabulary being learned and tested in different domains.
CVSep 25, 2015
Discriminative Map Retrieval Using View-Dependent Map DescriptorEnfu Liu, Kanji Tanaka
Map retrieval, the problem of similarity search over a large collection of 2D pointset maps previously built by mobile robots, is crucial for autonomous navigation in indoor and outdoor environments. Bag-of-words (BoW) methods constitute a popular approach to map retrieval; however, these methods have extremely limited descriptive ability because they ignore the spatial layout information of the local features. The main contribution of this paper is an extension of the bag-of-words map retrieval method to enable the use of spatial information from local features. Our strategy is to explicitly model a unique viewpoint of an input local map; the pose of the local feature is defined with respect to this unique viewpoint, and can be viewed as an additional invariant feature for discriminative map retrieval. Specifically, we wish to determine a unique viewpoint that is invariant to moving objects, clutter, occlusions, and actual viewpoints. Hence, we perform scene parsing to analyze the scene structure, and consider the "center" of the scene structure to be the unique viewpoint. Our scene parsing is based on a Manhattan world grammar that imposes a quasi-Manhattan world constraint to enable the robust detection of a scene structure that is invariant to clutter and moving objects. Experimental results using the publicly available radish dataset validate the efficacy of the proposed approach.
CVSep 25, 2015
Incremental Loop Closure Verification by Guided SamplingKanji Tanaka
Loop closure detection, the task of identifying locations revisited by a robot in a sequence of odometry and perceptual observations, is typically formulated as a combination of two subtasks: (1) bag-of-words image retrieval and (2) post-verification using RANSAC geometric verification. The main contribution of this study is the proposal of a novel post-verification framework that achieves good precision recall trade-off in loop closure detection. This study is motivated by the fact that not all loop closure hypotheses are equally plausible (e.g., owing to mutual consistency between loop closure constraints) and that if we have evidence that one hypothesis is more plausible than the others, then it should be verified more frequently. We demonstrate that the problem of loop closure detection can be viewed as an instance of a multi-model hypothesize-and-verify framework and build guided sampling strategies on the framework where loop closures proposed using image retrieval are verified in a planned order (rather than in a conventional uniform order) to operate in a constant time. Experimental results using a stereo SLAM system confirm that the proposed strategy, the use of loop closure constraints and robot trajectory hypotheses as a guide, achieves promising results despite the fact that there exists a significant number of false positive constraints and hypotheses.
ROJun 24, 2015
Incremental RANSAC for Online Relocation in Large Dynamic EnvironmentsKanji Tanaka, Eiji Kondo
Vehicle relocation is the problem in which a mobile robot has to estimate the self-position with respect to an a priori map of landmarks using the perception and the motion measurements without using any knowledge of the initial self-position. Recently, RANdom SAmple Consensus (RANSAC), a robust multi-hypothesis estimator, has been successfully applied to offline relocation in static environments. On the other hand, online relocation in dynamic environments is still a difficult problem, for available computation time is always limited, and for measurement include many outliers. To realize real time algorithm for such an online process, we have developed an incremental version of RANSAC algorithm by extending an efficient preemption RANSAC scheme. This novel scheme named incremental RANSAC is able to find inlier hypotheses of self-positions out of large number of outlier hypotheses contaminated by outlier measurements.