Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding
This work addresses the need for versatile semantic understanding in indoor mobile and assistive autonomous systems, though it is incremental as it builds on existing SLAM and LLM methods.
The authors tackled the problem of limited adaptability in indoor autonomous systems by introducing LEXIS, a real-time SLAM system that uses LLMs for open-vocabulary scene understanding, achieving SOTA performance in room categorization and equivalent performance in place recognition and trajectory estimation.
Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.