Yuqi Xiong

AI
h-index4
3papers
4citations
Novelty53%
AI Score43

3 Papers

AIJan 29Code
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Yuqi Xiong, Chunyi Peng, Zhipeng Xu et al.

Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.

CVAug 28, 2025Code
Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection

Yuqi Xiong, Wuzhen Shi, Yang Wen et al.

In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.

LGMay 11, 2025
Non-Stationary Time Series Forecasting Based on Fourier Analysis and Cross Attention Mechanism

Yuqi Xiong, Yang Wen

Time series forecasting has important applications in financial analysis, weather forecasting, and traffic management. However, existing deep learning models are limited in processing non-stationary time series data because they cannot effectively capture the statistical characteristics that change over time. To address this problem, this paper proposes a new framework, AEFIN, which enhances the information sharing ability between stable and unstable components by introducing a cross-attention mechanism, and combines Fourier analysis networks with MLP to deeply explore the seasonal patterns and trend characteristics in unstable components. In addition, we design a new loss function that combines time-domain stability constraints, time-domain instability constraints, and frequency-domain stability constraints to improve the accuracy and robustness of forecasting. Experimental results show that AEFIN outperforms the most common models in terms of mean square error and mean absolute error, especially under non-stationary data conditions, and shows excellent forecasting capabilities. This paper provides an innovative solution for the modeling and forecasting of non-stationary time series data, and contributes to the research of deep learning for complex time series.