CVSep 14, 2022
Semantic Visual Simultaneous Localization and Mapping: A SurveyKaiqi Chen, Junhao Xiao, Jialing Liu et al.
Visual Simultaneous Localization and Mapping (vSLAM) has achieved great progress in the computer vision and robotics communities, and has been successfully used in many fields such as autonomous robot navigation and AR/VR. However, vSLAM cannot achieve good localization in dynamic and complex environments. Numerous publications have reported that, by combining with the semantic information with vSLAM, the semantic vSLAM systems have the capability of solving the above problems in recent years. Nevertheless, there is no comprehensive survey about semantic vSLAM. To fill the gap, this paper first reviews the development of semantic vSLAM, explicitly focusing on its strengths and differences. Secondly, we explore three main issues of semantic vSLAM: the extraction and association of semantic information, the application of semantic information, and the advantages of semantic vSLAM. Then, we collect and analyze the current state-of-the-art SLAM datasets which have been widely used in semantic vSLAM systems. Finally, we discuss future directions that will provide a blueprint for the future development of semantic vSLAM.
98.8CLMay 21Code
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent ReasoningYifan Dai, Zhenhua Wu, Bohan Zeng et al.
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
CVNov 21, 2025
PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-AttentionYipeng Chen, Zhichao Ye, Zhenzhou Fang et al.
We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/
ROJun 23, 2021
Collaborative Visual Inertial SLAM for Multiple Smart PhonesJialing Liu, Ruyu Liu, Kaiqi Chen et al.
The efficiency and accuracy of mapping are crucial in a large scene and long-term AR applications. Multi-agent cooperative SLAM is the precondition of multi-user AR interaction. The cooperation of multiple smart phones has the potential to improve efficiency and robustness of task completion and can complete tasks that a single agent cannot do. However, it depends on robust communication, efficient location detection, robust mapping, and efficient information sharing among agents. We propose a multi-intelligence collaborative monocular visual-inertial SLAM deployed on multiple ios mobile devices with a centralized architecture. Each agent can independently explore the environment, run a visual-inertial odometry module online, and then send all the measurement information to a central server with higher computing resources. The server manages all the information received, detects overlapping areas, merges and optimizes the map, and shares information with the agents when needed. We have verified the performance of the system in public datasets and real environments. The accuracy of mapping and fusion of the proposed system is comparable to VINS-Mono which requires higher computing resources.
CVDec 21, 2020
Accurate Object Association and Pose Updating for Semantic SLAMKaiqi Chen, Jialing Liu, Qinying Chen et al.
Current pandemic has caused the medical system to operate under high load. To relieve it, robots with high autonomy can be used to effectively execute contactless operations in hospitals and reduce cross-infection between medical staff and patients. Although semantic Simultaneous Localization and Mapping (SLAM) technology can improve the autonomy of robots, semantic object association is still a problem that is worthy of being studied. The key to solving this problem is to correctly associate multiple object measurements of one object landmark by using semantic information, and to refine the pose of object landmark in real time. To this end, we propose a hierarchical object association strategy and a pose-refinement approach. The former one consists of two levels, i.e., a short-term object association and a global one. In the first level, we employ the multiple-object-tracking for short-term object association, through which the incorrect association among objects whose locations are close and appearances are similar can be avoided. Moreover, the short-term object association can provide more abundant object appearance and more robust estimation of object pose for the global object association in the second level. To refine the object pose in the map, we develop an approach to choose the optimal object pose from all object measurements associated with an object landmark. The proposed method is comprehensively evaluated on seven simulated hospital sequences1, a real hospital environment and the KITTI dataset. Experimental results show that our method has an obviously improvement in terms of robustness and accuracy for the object association and the trajectory estimation in the semantic SLAM.