CVFeb 25, 2023Code
SUPS: A Simulated Underground Parking Scenario Dataset for Autonomous DrivingJiawei Hou, Qi Chen, Yurong Cheng et al.
Automatic underground parking has attracted considerable attention as the scope of autonomous driving expands. The auto-vehicle is supposed to obtain the environmental information, track its location, and build a reliable map of the scenario. Mainstream solutions consist of well-trained neural networks and simultaneous localization and mapping (SLAM) methods, which need numerous carefully labeled images and multiple sensor estimations. However, there is a lack of underground parking scenario datasets with multiple sensors and well-labeled images that support both SLAM tasks and perception tasks, such as semantic segmentation and parking slot detection. In this paper, we present SUPS, a simulated dataset for underground automatic parking, which supports multiple tasks with multiple sensors and multiple semantic labels aligned with successive images according to timestamps. We intend to cover the defect of existing datasets with the variability of environments and the diversity and accessibility of sensors in the virtual scene. Specifically, the dataset records frames from four surrounding fisheye cameras, two forward pinhole cameras, a depth camera, and data from LiDAR, inertial measurement unit (IMU), GNSS. Pixel-level semantic labels are provided for objects, especially ground signs such as arrows, parking lines, lanes, and speed bumps. Perception, 3D reconstruction, depth estimation, and SLAM, and other relative tasks are supported by our dataset. We also evaluate the state-of-the-art SLAM algorithms and perception models on our dataset. Finally, we open source our virtual 3D scene built based on Unity Engine and release our dataset at https://github.com/jarvishou829/SUPS.
91.7SDMay 29
MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained PriorsGuangyin Bao, Taiping Zeng, Jianfeng Feng et al.
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
LGJun 2, 2025Code
Latent Structured Hopfield Network for Semantic Association and RetrievalChong Li, Xiangyang Xue, Jianfeng Feng et al.
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations. While large pretrained models have shown remarkable progress in modeling semantic memory, the mechanisms for forming associative structures that support episodic memory remain underexplored. Inspired by hippocampal CA3 dynamics and its role in associative memory, we propose the Latent Structured Hopfield Network (LSHN), a biologically inspired framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture. LSHN mimics the cortical-hippocampal pathway: a semantic encoder extracts compact latent representations, a latent Hopfield network performs associative refinement through attractor convergence, and a decoder reconstructs perceptual input. Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval. Experiments on MNIST, CIFAR-10, and a simulated episodic memory task demonstrate superior performance in recalling corrupted inputs under occlusion and noise, outperforming existing associative memory models. Our work provides a computational perspective on how semantic elements can be dynamically bound into episodic memory traces through biologically grounded attractor mechanisms. Code: https://github.com/fudan-birlab/LSHN.
ROMay 30, 2025
Hi-Dyna Graph: Hierarchical Dynamic Scene Graph for Robotic Autonomy in Human-Centric EnvironmentsJiawei Hou, Xiangyang Xue, Taiping Zeng
Autonomous operation of service robotics in human-centric scenes remains challenging due to the need for understanding of changing environments and context-aware decision-making. While existing approaches like topological maps offer efficient spatial priors, they fail to model transient object relationships, whereas dense neural representations (e.g., NeRF) incur prohibitive computational costs. Inspired by the hierarchical scene representation and video scene graph generation works, we propose Hi-Dyna Graph, a hierarchical dynamic scene graph architecture that integrates persistent global layouts with localized dynamic semantics for embodied robotic autonomy. Our framework constructs a global topological graph from posed RGB-D inputs, encoding room-scale connectivity and large static objects (e.g., furniture), while environmental and egocentric cameras populate dynamic subgraphs with object position relations and human-object interaction patterns. A hybrid architecture is conducted by anchoring these subgraphs to the global topology using semantic and spatial constraints, enabling seamless updates as the environment evolves. An agent powered by large language models (LLMs) is employed to interpret the unified graph, infer latent task triggers, and generate executable instructions grounded in robotic affordances. We conduct complex experiments to demonstrate Hi-Dyna Grap's superior scene representation effectiveness. Real-world deployments validate the system's practicality with a mobile manipulator: robotics autonomously complete complex tasks with no further training or complex rewarding in a dynamic scene as cafeteria assistant. See https://anonymous.4open.science/r/Hi-Dyna-Graph-B326 for video demonstration and more details.
CVNov 24, 2025
DetAny4D: Detect Anything 4D Temporally in a Streaming RGB VideoJiawei Hou, Shenghao Zhang, Can Wang et al.
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.
ROMar 6, 2020
StereoNeuroBayesSLAM: A Neurobiologically Inspired Stereo Visual SLAM System Based on Direct Sparse MethodTaiping Zeng, Xiaoli Li, Bailu Si
We propose a neurobiologically inspired visual simultaneous localization and mapping (SLAM) system based on direction sparse method to real-time build cognitive maps of large-scale environments from a moving stereo camera. The core SLAM system mainly comprises a Bayesian attractor network, which utilizes neural responses of head direction (HD) cells in the hippocampus and grid cells in the medial entorhinal cortex (MEC) to represent the head direction and the position of the robot in the environment, respectively. Direct sparse method is employed to accurately and robustly estimate velocity information from a stereo camera. Input rotational and translational velocities are integrated by the HD cell and grid cell networks, respectively. We demonstrated our neurobiologically inspired stereo visual SLAM system on the KITTI odometry benchmark datasets. Our proposed SLAM system is robust to real-time build a coherent semi-metric topological map from a stereo camera. Qualitative evaluation on cognitive maps shows that our proposed neurobiologically inspired stereo visual SLAM system outperforms our previous brain-inspired algorithms and the neurobiologically inspired monocular visual SLAM system both in terms of tracking accuracy and robustness, which is closer to the traditional state-of-the-art one.
NCOct 10, 2019
Learning Sparse Spatial Codes for Cognitive Mapping Inspired by Entorhinal-Hippocampal NeurocircuitTaiping Zeng, XiaoLi Li, Bailu Si
The entorhinal-hippocampal circuit plays a critical role in higher brain functions, especially spatial cognition. Grid cells in the medial entorhinal cortex (MEC) periodically fire with different grid spacing and orientation, which makes a contribution that place cells in the hippocampus can uniquely encode locations in an environment. But how sparse firing granule cells in the dentate gyrus are formed from grid cells in the MEC remains to be determined. Recently, the fruit fly olfactory circuit provides a variant algorithm (called locality-sensitive hashing) to solve this problem. To investigate how the sparse place firing generates in the dentate gyrus can help animals to break the perception ambiguity during environment exploration, we build a biologically relevant, computational model from grid cells to place cells. The weight from grid cells to dentate gyrus granule cells is learned by competitive Hebbian learning. We resorted to the robot system for demonstrating our cognitive mapping model on the KITTI odometry benchmark dataset. The experimental results show that our model is able to stably, robustly build a coherent semi-metric topological map in the large-scale outdoor environment. The experimental results suggest that the entorhinal-hippocampal circuit as a variant locality-sensitive hashing algorithm is capable of generating sparse encoding for easily distinguishing different locations in the environment. Our experiments also provide theoretical supports that this analogous hashing algorithm may be a general principle of computation in different brain regions and species.
ROOct 9, 2019
A Brain-Inspired Compact Cognitive Mapping SystemTaiping Zeng, Bailu Si
As the robot explores the environment, the map grows over time in the simultaneous localization and mapping (SLAM) system, especially for the large scale environment. The ever-growing map prevents long-term mapping. In this paper, we developed a compact cognitive mapping approach inspired by neurobiological experiments. Inspired from neighborhood cells, neighborhood fields determined by movement information, i.e. translation and rotation, are proposed to describe one of distinct segments of the explored environment. The vertices and edges with movement information below the threshold of the neighborhood fields are avoided adding to the cognitive map. The optimization of the cognitive map is formulated as a robust non-linear least squares problem, which can be efficiently solved by the fast open linear solvers as a general problem. According to the cognitive decision-making of familiar environments, loop closure edges are clustered depending on time intervals, and then parallel computing is applied to perform batch global optimization of the cognitive map for ensuring the efficiency of computation and real-time performance. After the loop closure process, scene integration is performed, in which revisited vertices are removed subsequently to further reduce the size of the cognitive map. A monocular visual SLAM system is developed to test our approach in a rat-like maze environment. Our results suggest that the method largely restricts the growth of the size of the cognitive map over time, and meanwhile, the compact cognitive map correctly represents the overall layer of the environment as the standard one. Experiments demonstrate that our method is very suited for compact cognitive mapping to support long-term robot mapping. Our approach is simple, but pragmatic and efficient for achieving the compact cognitive map.