CVDec 30, 2022Code
HPointLoc: Point-based Indoor Place Recognition using Synthetic RGB-D ImagesDmitry Yudin, Yaroslav Solomentsev, Ruslan Musaev et al.
We present a novel dataset named as HPointLoc, specially designed for exploring capabilities of visual place recognition in indoor environment and loop detection in simultaneous localization and mapping. The loop detection sub-task is especially relevant when a robot with an on-board RGB-D camera can drive past the same place (``Point") at different angles. The dataset is based on the popular Habitat simulator, in which it is possible to generate photorealistic indoor scenes using both own sensor data and open datasets, such as Matterport3D. To study the main stages of solving the place recognition problem on the HPointLoc dataset, we proposed a new modular approach named as PNTR. It first performs an image retrieval with the Patch-NetVLAD method, then extracts keypoints and matches them using R2D2, LoFTR or SuperPoint with SuperGlue, and finally performs a camera pose optimization step with TEASER++. Such a solution to the place recognition problem has not been previously studied in existing publications. The PNTR approach has shown the best quality metrics on the HPointLoc dataset and has a high potential for real use in localization systems for unmanned vehicles. The proposed dataset and framework are publicly available: https://github.com/metra4ok/HPointLoc.
RONov 7, 2023Code
Interactive Semantic Map Representation for Skill-based Visual Object NavigationTatiana Zemskova, Aleksei Staroverov, Kirill Muravyev et al.
Visual object navigation using learning methods is one of the key tasks in mobile robotics. This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. It is based on a neural network method that adjusts the weights of the segmentation model with backpropagation of the predicted fusion loss values during inference on a regular (backward) or delayed (forward) image sequence. We have implemented this representation into a full-fledged navigation approach called SkillTron, which can select robot skills from end-to-end policies based on reinforcement learning and classic map-based planning methods. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation. We conducted intensive experiments with the proposed approach in the Habitat environment, which showed a significant superiority in navigation quality metrics compared to state-of-the-art approaches. The developed code and used custom datasets are publicly available at github.com/AIRI-Institute/skill-fusion.
CVOct 18, 2023Code
SegmATRon: Embodied Adaptive Semantic Segmentation for Indoor EnvironmentTatiana Zemskova, Margarita Kichik, Dmitry Yudin et al.
This paper presents an adaptive transformer model named SegmATRon for embodied image semantic segmentation. Its distinctive feature is the adaptation of model weights during inference on several images using a hybrid multicomponent loss function. We studied this model on datasets collected in the photorealistic Habitat and the synthetic AI2-THOR Simulators. We showed that obtaining additional images using the agent's actions in an indoor environment can improve the quality of semantic segmentation. The code of the proposed approach and datasets are publicly available at https://github.com/wingrune/SegmATRon.
66.0ROMar 30Code
OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal NavigationTatiana Zemskova, Aleksei Staroverov, Dmitry Yudin et al.
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies overfit small simulator datasets, achieving high success on training scenes but failing to generalize and exhibiting unsafe behaviour (frequent collisions). We introduce OVSegDT, a lightweight transformer policy that tackles these issues with two synergistic components. The first component is the semantic branch, which includes an encoder for the target binary mask and an auxiliary segmentation loss function, grounding the textual goal and providing precise spatial cues. The second component consists of a proposed Entropy-Adaptive Loss Modulation, a per-sample scheduler that continuously balances imitation and reinforcement signals according to the policy entropy, eliminating brittle manual phase switches. These additions cut the sample complexity of training by 33%, and reduce collision count in two times while keeping inference cost low (130M parameters, RGB-only input). On HM3D-OVON, our model matches the performance on unseen categories to that on seen ones and establishes state-of-the-art results (40.1% SR, 20.9% SPL on val unseen) without depth, odometry, or large vision-language models. Code is available at https://github.com/CognitiveAISystems/OVSegDT.
53.6ROMar 24Code
Dynamic Neural Potential Field: Online Trajectory Optimization in the Presence of Moving ObstaclesAleksei Staroverov, Muhammad Alhaddad, Aditya Narendra et al.
Generalist robot policies must operate safely and reliably in everyday human environments such as homes, offices, and warehouses, where people and objects move unpredictably. We present Dynamic Neural Potential Field (NPField-GPT), a learning-enhanced model predictive control (MPC) framework that couples classical optimization with a Transformer-based predictor of footprint-aware repulsive potentials. Given an occupancy sub-map, robot footprint, and optional dynamic-obstacle cues, our NPField-GPT model forecasts a horizon of differentiable potentials that are injected into a sequential quadratic MPC program via L4CasADi, yielding real-time, constraint-aware trajectory optimization. We additionally study two baselines: NPField-StaticMLP, where a dynamic scene is treated as a sequence of static maps; and NPField-DynamicMLP, which predicts the future potential sequence in parallel with an MLP. In dynamic indoor scenarios from BenchMR and on a Husky UGV in office corridors, NPField-GPT produces more efficient and safer trajectories under motion changes, while StaticMLP/DynamicMLP offer lower latency. We also compare with the CIAO* and MPPI baselines. Across methods, the Transformer+MPC synergy preserves the transparency and stability of model-based planning while learning only the part that benefits from data: spatiotemporal collision risk. Code and trained models are available at https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field