76.5CVMay 20Code
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language ModelsQirui Shen, Wenda Wang, Jiachen Lu et al.
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.
CVJun 15, 2023Code
Revisiting Stereo Triangulation in UAV Distance EstimationJiafan Zhuang, Duan Yuan, Rihong Yan et al.
Distance estimation plays an important role for path planning and collision avoidance of swarm UAVs. However, the lack of annotated data seriously hinders the related studies. In this work, we build and present a UAVDE dataset for UAV distance estimation, in which distance between two UAVs is obtained by UWB sensors. During experiments, we surprisingly observe that the stereo triangulation cannot stand for UAV scenes. The core reason is the position deviation issue due to long shooting distance and camera vibration, which is common in UAV scenes. To tackle this issue, we propose a novel position correction module, which can directly predict the offset between the observed positions and the actual ones and then perform compensation in stereo triangulation calculation. Besides, to further boost performance on hard samples, we propose a dynamic iterative correction mechanism, which is composed of multiple stacked PCMs and a gating mechanism to adaptively determine whether further correction is required according to the difficulty of data samples. We conduct extensive experiments on UAVDE, and our method can achieve a significant performance improvement over a strong baseline (by reducing the relative difference from 49.4% to 9.8%), which demonstrates its effectiveness and superiority. The code and dataset are available at https://github.com/duanyuan13/PCM.
51.8CVMay 20
Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn SketchesWenda Wang, Anqi Liu, Junqi Yang et al.
Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.
CVDec 11, 2020
Laser Data Based Automatic Generation of Lane-Level Road Map for Intelligent VehiclesZehai Yu, Hui Zhu, Linglong Lin et al.
With the development of intelligent vehicle systems, a high-precision road map is increasingly needed in many aspects. The automatic lane lines extraction and modeling are the most essential steps for the generation of a precise lane-level road map. In this paper, an automatic lane-level road map generation system is proposed. To extract the road markings on the ground, the multi-region Otsu thresholding method is applied, which calculates the intensity value of laser data that maximizes the variance between background and road markings. The extracted road marking points are then projected to the raster image and clustered using a two-stage clustering algorithm. Lane lines are subsequently recognized from these clusters by the shape features of their minimum bounding rectangle. To ensure the storage efficiency of the map, the lane lines are approximated to cubic polynomial curves using a Bayesian estimation approach. The proposed lane-level road map generation system has been tested on urban and expressway conditions in Hefei, China. The experimental results on the datasets show that our method can achieve excellent extraction and clustering effect, and the fitted lines can reach a high position accuracy with an error of less than 10 cm
CVNov 26, 2020
A Fast Point Cloud Ground Segmentation Approach Based on Coarse-To-Fine Markov Random FieldWeixin Huang, Huawei Liang, Linglong Lin et al.
Ground segmentation is an important preprocessing task for autonomous vehicles (AVs) with 3D LiDARs. To solve the problem of existing ground segmentation methods being very difficult to balance accuracy and computational complexity, a fast point cloud ground segmentation approach based on a coarse-to-fine Markov random field (MRF) method is proposed. The method uses an improved elevation map for ground coarse segmentation, and then uses spatiotemporal adjacent points to optimize the segmentation results. The processed point cloud is classified into high-confidence obstacle points, ground points, and unknown classification points to initialize an MRF model. The graph cut method is then used to solve the model to achieve fine segmentation. Experiments on datasets showed that our method improves on other algorithms in terms of ground segmentation accuracy and is faster than other graph-based algorithms, which require only a single core of an I7-3770 CPU to process a frame of Velodyne HDL-64E data (in 39.77 ms, on average). Field tests were also conducted to demonstrate the effectiveness of the proposed method.