Xueyun Chen

CV
3papers
6citations
Novelty52%
AI Score39

3 Papers

98.4ROJun 2
GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

Yizhi Chen, Zhanxiang Cao, Xinyi Peng et al.

Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.

CVAug 17, 2022
Road detection via a dual-task network based on cross-layer graph fusion modules

Zican Hu, Wurui Shi, Hongkun Liu et al.

Road detection based on remote sensing images is of great significance to intelligent traffic management. The performances of the mainstream road detection methods are mainly determined by their extracted features, whose richness and robustness can be enhanced by fusing features of different types and cross-layer connections. However, the features in the existing mainstream model frameworks are often similar in the same layer by the single-task training, and the traditional cross-layer fusion ways are too simple to obtain an efficient effect, so more complex fusion ways besides concatenation and addition deserve to be explored. Aiming at the above defects, we propose a dual-task network (DTnet) for road detection and cross-layer graph fusion module (CGM): the DTnet consists of two parallel branches for road area and edge detection, respectively, while enhancing the feature diversity by fusing features between two branches through our designed feature bridge modules (FBM). The CGM improves the cross-layer fusion effect by a complex feature stream graph, and four graph patterns are evaluated. Experimental results on three public datasets demonstrate that our method effectively improves the final detection result.

CVAug 17, 2022
IDAN: Image Difference Attention Network for Change Detection

Hongkun Liu, Zican Hu, Qichen Ding et al.

Remote sensing image change detection is of great importance in disaster assessment and urban planning. The mainstream method is to use encoder-decoder models to detect the change region of two input images. Since the change content of remote sensing images has the characteristics of wide scale range and variety, it is necessary to improve the detection accuracy of the network by increasing the attention mechanism, which commonly includes: Squeeze-and-Excitation block, Non-local and Convolutional Block Attention Module, among others. These methods consider the importance of different location features between channels or within channels, but fail to perceive the differences between input images. In this paper, we propose a novel image difference attention network (IDAN). In the image preprocessing stage, we use a pre-training model to extract the feature differences between two input images to obtain the feature difference map (FD-map), and Canny for edge detection to obtain the edge difference map (ED-map). In the image feature extracting stage, the FD-map and ED-map are input to the feature difference attention module and edge compensation module, respectively, to optimize the features extracted by IDAN. Finally, the change detection result is obtained through the feature difference operation. IDAN comprehensively considers the differences in regional and edge features of images and thus optimizes the extracted image features. The experimental results demonstrate that the F1-score of IDAN improves 1.62% and 1.98% compared to the baseline model on WHU dataset and LEVIR-CD dataset, respectively.