Xiaolin Qin

CV
h-index11
21papers
48citations
Novelty52%
AI Score53

21 Papers

CVMar 8, 2022Code
YouTube-GDD: A challenging gun detection dataset with rich contextual information

Yongxiang Gu, Xingbin Liao, Xiaolin Qin

An automatic gun detection system can detect potential gun-related violence at an early stage that is of paramount importance for citizens security. In the whole system, object detection algorithm is the key to perceive the environment so that the system can detect dangerous objects such as pistols and rifles. However, mainstream deep learning-based object detection algorithms depend heavily on large-scale high-quality annotated samples, and the existing gun datasets are characterized by low resolution, little contextual information and little data volume. To promote the development of security, this work presents a new challenging dataset called YouTube Gun Detection Dataset (YouTube-GDD). Our dataset is collected from 343 high-definition YouTube videos and contains 5000 well-chosen images, in which 16064 instances of gun and 9046 instances of person are annotated. Compared to other datasets, YouTube-GDD is "dynamic", containing rich contextual information and recording shape changes of the gun during shooting. To build a baseline for gun detection, we evaluate YOLOv5 on YouTube-GDD and analyze the influence of additional related annotated information on gun detection. YouTube-GDD and subsequent updates will be released at https://github.com/UCAS-GYX/YouTube-GDD.

SYJun 27, 2012
Structural analysis of high-index DAE for process simulation

Xiaolin Qin, Wenyuan Wu, Yong Feng et al.

This paper deals with the structural analysis problem of dynamic lumped process high-index DAE models. We consider two methods for index reduction of such models by differentiation: Pryce's method and the symbolic differential elimination algorithm rifsimp. Discussion and comparison of these methods are given via a class of fundamental process simulation examples. In particular, the efficiency of the Pryce method is illustrated as a function of the number of tanks in process design.

82.7CVApr 14
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

Zijian Liu, Sihan Cao, Pengcheng Zheng et al.

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

AO-PHSep 6, 2024
STAA: Spatio-Temporal Alignment Attention for Short-Term Precipitation Forecasting

Min Chen, Hao Yang, Shaohan Li et al.

There is a great need to accurately predict short-term precipitation, which has socioeconomic effects such as agriculture and disaster prevention. Recently, the forecasting models have employed multi-source data as the multi-modality input, thus improving the prediction accuracy. However, the prevailing methods usually suffer from the desynchronization of multi-source variables, the insufficient capability of capturing spatio-temporal dependency, and unsatisfactory performance in predicting extreme precipitation events. To fix these problems, we propose a short-term precipitation forecasting model based on spatio-temporal alignment attention, with SATA as the temporal alignment module and STAU as the spatio-temporal feature extractor to filter high-pass features from precipitation signals and capture multi-term temporal dependencies. Based on satellite and ERA5 data from the southwestern region of China, our model achieves improvements of 12.61\% in terms of RMSE, in comparison with the state-of-the-art methods.

LGJul 23, 2025Code
Met$^2$Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems

Shaohan Li, Hao Yang, Min Chen et al.

The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances have been made by the \textbf{end-to-end methods}, thanks to deep learning techniques, but they face limitations of \textit{representation inconsistency} in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbf{two-stage training approach} from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments show the state-of-the-art performance of our method. Specifically, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82\% and 23.39\%, respectively. The source code is available at https://github.com/ShremG/Met2Net.

48.5CVApr 27
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

Qianlei Wang, Kexun Chen, Shaolin Zhang et al.

Monocular depth estimation (MDE) has witnessed remarkable progress driven by Convolutional Neural Networks and transformer-based architectures. However, these approaches typically treat the problem as a generic image-to-image regression on Euclidean grids, thereby overlooking the intrinsic algebraic and geometric structures induced by perspective projection. To address this limitation, we propose LAGRNet, a novel framework that fundamentally grounds MDE in algebraic geometry by explicitly embedding learnable group, ring, and sheaf structures into the deep learning pipeline. Modeling feature maps as sections of a sheaf over an approximated image manifold, our method first establishes a Group-defined Feature Manifold (GFM) parameterized by a learned algebraic group action to enforce projective equivariance and robustness against view changes. To facilitate algebraically consistent cross-scale interactions, we subsequently introduce a Ring Convolution Layer (RCL) that formulates feature fusion as a graded ring homomorphism. Furthermore, to ensure global topological consistency, a Sheaf-based Module (SM) aggregates local depth cues via Čech nerve on the image topology. Extensive zero-shot evaluations across the KITTI, NYU-Depth V2, and ETH3D benchmarks demonstrate that LAGRNet significantly outperforms state-of-the-art methods in both accuracy and generalization capabilities.

49.4CVApr 27
Unconstrained Multi-view Human Pose Estimation with Algebraic Priors

Xiaolin Qin, Qianlei Wang, Jiacen Liu et al.

Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gröbner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.

LGFeb 17, 2025
CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning

Yanxiao Zhao, Yangge Qian, Jingyang Shan et al.

Reinforcement learning (RL) in continuous action spaces encounters persistent challenges, such as inefficient exploration and convergence to suboptimal solutions. To address these limitations, we propose CAMEL, a novel framework integrating LLM-generated suboptimal policies into the RL training pipeline. CAMEL leverages dynamic action masking and an adaptive epsilon-masking mechanism to guide exploration during early training stages while gradually enabling agents to optimize policies independently. At the core of CAMEL lies the integration of Python-executable suboptimal policies generated by LLMs based on environment descriptions and task objectives. Although simplistic and hard-coded, these policies offer valuable initial guidance for RL agents. To effectively utilize these priors, CAMEL employs masking-aware optimization to dynamically constrain the action space based on LLM outputs. Additionally, epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling agents to transition from constrained exploration to autonomous policy refinement. Experimental validation on Gymnasium MuJoCo environments demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated policies significantly improve sample efficiency, achieving performance comparable to or surpassing expert masking baselines. For Walker2d-v4, where LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust RL performance without notable degradation, highlighting the framework's adaptability across diverse tasks. While CAMEL shows promise in enhancing sample efficiency and mitigating convergence challenges, these issues remain open for further research. Future work aims to generalize CAMEL to multimodal LLMs for broader observation-action spaces and automate policy evaluation, reducing human intervention and enhancing scalability in RL training pipelines.

CVNov 18, 2025
Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction

Hao Tian, Chenyangguang Zhang, Rui Liu et al.

This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.

CVSep 2, 2025
A Multimodal Cross-View Model for Predicting Postoperative Neck Pain in Cervical Spondylosis Patients

Jingyang Shan, Qishuai Yu, Jiacen Liu et al.

Neck pain is the primary symptom of cervical spondylosis, yet its underlying mechanisms remain unclear, leading to uncertain treatment outcomes. To address the challenges of multimodal feature fusion caused by imaging differences and spatial mismatches, this paper proposes an Adaptive Bidirectional Pyramid Difference Convolution (ABPDC) module that facilitates multimodal integration by exploiting the advantages of difference convolution in texture extraction and grayscale invariance, and a Feature Pyramid Registration Auxiliary Network (FPRAN) to mitigate structural misalignment. Experiments on the MMCSD dataset demonstrate that the proposed model achieves superior prediction accuracy of postoperative neck pain recovery compared with existing methods, and ablation studies further confirm its effectiveness.

AIAug 31, 2025
SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs

Yanxiao Zhao, Yaqian Li, Zihao Bo et al.

Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.

CVJul 23, 2025
SRMambaV2: Biomimetic Attention for Sparse Point Cloud Upsampling in Autonomous Driving

Chuang Chen, Xiaolin Qin, Jing Hu et al.

Upsampling LiDAR point clouds in autonomous driving scenarios remains a significant challenge due to the inherent sparsity and complex 3D structures of the data. Recent studies have attempted to address this problem by converting the complex 3D spatial scenes into 2D image super-resolution tasks. However, due to the sparse and blurry feature representation of range images, accurately reconstructing detailed and complex spatial topologies remains a major difficulty. To tackle this, we propose a novel sparse point cloud upsampling method named SRMambaV2, which enhances the upsampling accuracy in long-range sparse regions while preserving the overall geometric reconstruction quality. Specifically, inspired by human driver visual perception, we design a biomimetic 2D selective scanning self-attention (2DSSA) mechanism to model the feature distribution in distant sparse areas. Meanwhile, we introduce a dual-branch network architecture to enhance the representation of sparse features. In addition, we introduce a progressive adaptive loss (PAL) function to further refine the reconstruction of fine-grained details during the upsampling process. Experimental results demonstrate that SRMambaV2 achieves superior performance in both qualitative and quantitative evaluations, highlighting its effectiveness and practical value in automotive sparse point cloud upsampling tasks.

CVApr 29, 2025
EfficientHuman: Efficient Training and Reconstruction of Moving Human using Articulated 2D Gaussian

Hao Tian, Rui Liu, Wen Shen et al.

3D Gaussian Splatting (3DGS) has been recognized as a pioneering technique in scene reconstruction and novel view synthesis. Recent work on reconstructing the 3D human body using 3DGS attempts to leverage prior information on human pose to enhance rendering quality and improve training speed. However, it struggles to effectively fit dynamic surface planes due to multi-view inconsistency and redundant Gaussians. This inconsistency arises because Gaussian ellipsoids cannot accurately represent the surfaces of dynamic objects, which hinders the rapid reconstruction of the dynamic human body. Meanwhile, the prevalence of redundant Gaussians means that the training time of these works is still not ideal for quickly fitting a dynamic human body. To address these, we propose EfficientHuman, a model that quickly accomplishes the dynamic reconstruction of the human body using Articulated 2D Gaussian while ensuring high rendering quality. The key innovation involves encoding Gaussian splats as Articulated 2D Gaussian surfels in canonical space and then transforming them to pose space via Linear Blend Skinning (LBS) to achieve efficient pose transformations. Unlike 3D Gaussians, Articulated 2D Gaussian surfels can quickly conform to the dynamic human body while ensuring view-consistent geometries. Additionally, we introduce a pose calibration module and an LBS optimization module to achieve precise fitting of dynamic human poses, enhancing the model's performance. Extensive experiments on the ZJU-MoCap dataset demonstrate that EfficientHuman achieves rapid 3D dynamic human reconstruction in less than a minute on average, which is 20 seconds faster than the current state-of-the-art method, while also reducing the number of redundant Gaussians.

CVMay 15, 2024
Fourier Boundary Features Network with Wider Catchers for Glass Segmentation

Xiaolin Qin, Jiacen Liu, Qianlei Wang et al.

Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We proposed the Fourier Boundary Features Network with Wider Catchers (FBWC), which might be the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we designed the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method has been validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.

LGMar 1, 2024
Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency

Yanxiao Zhao, Yangge Qian, Tianyi Wang et al.

Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the computational resource consumption of DRL algorithms. Previous works to leverage existing computational work require intrusive modifications to existing algorithms and models, designed specifically for specific algorithms, lacking flexibility and universality. In this paper, we present the Snapshot Reinforcement Learning (SnapshotRL) framework, which enhances sample efficiency by simply altering environments, without making any modifications to algorithms and models. By allowing student agents to choose states in teacher trajectories as the initial state to sample, SnapshotRL can effectively utilize teacher trajectories to assist student agents in training, allowing student agents to explore a larger state space at the early training phase. We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms. Our experiments demonstrate that integrating S3RL with TD3, SAC, and PPO algorithms on the MuJoCo benchmark significantly improves sample efficiency and average return, without extra samples and additional computational resources.

CVJul 30, 2021
Real-time Streaming Perception System for Autonomous Driving

Yongxiang Gu, Qianlei Wang, Xiaolin Qin

Nowadays, plenty of deep learning technologies are being applied to all aspects of autonomous driving with promising results. Among them, object detection is the key to improve the ability of an autonomous agent to perceive its environment so that it can (re)act. However, previous vision-based object detectors cannot achieve satisfactory performance under real-time driving scenarios. To remedy this, we present the real-time steaming perception system in this paper, which is also the 2nd Place solution of Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) for the detection-only track. Unlike traditional object detection challenges, which focus mainly on the absolute performance, streaming perception task requires achieving a balance of accuracy and latency, which is crucial for real-time autonomous driving. We adopt YOLOv5 as our basic framework, data augmentation, Bag-of-Freebies, and Transformer are adopted to improve streaming object detection performance with negligible extra inference cost. On the Argoverse-HD test set, our method achieves 33.2 streaming AP (34.6 streaming AP verified by the organizer) under the required hardware. Its performance significantly surpasses the fixed baseline of 13.6 (host team), demonstrating the potentiality of application.

CVMay 20, 2021
Content-Augmented Feature Pyramid Network with Light Linear Spatial Transformers for Object Detection

Yongxiang Gu, Xiaolin Qin, Yuncong Peng et al.

As one of the prevalent components, Feature Pyramid Network (FPN) is widely used in current object detection models for improving multi-scale object detection performance. However, its feature fusion mode is still in a misaligned and local manner, thus limiting the representation power. To address the inherit defects of FPN, a novel architecture termed Content-Augmented Feature Pyramid Network (CA-FPN) is proposed in this paper. Firstly, a Global Content Extraction Module (GCEM) is proposed to extract multi-scale context information. Secondly, lightweight linear spatial Transformer connections are added in the top-down pathway to augment each feature map with multi-scale features, where a linearized approximate self-attention function is designed for reducing model complexity. By means of the self-attention mechanism in Transformer, there is no longer need to align feature maps during feature fusion, thus solving the misaligned defect. By setting the query scope to the entire feature map, the local defect can also be solved. Extensive experiments on COCO and PASCAL VOC datasets demonstrated that our CA-FPN outperforms other FPN-based detectors without bells and whistles and is robust in different settings.

CVJul 13, 2018
Single Bitmap Block Truncation Coding of Color Images Using Hill Climbing Algorithm

Lige Zhang, Xiaolin Qin, Qing Li et al.

Recently, the use of digital images in various fields is increasing rapidly. To increase the number of images stored and get faster transmission of them, it is necessary to reduce the size of these images. Single bitmap block truncation coding (SBBTC) schemes are compression techniques, which are used to generate a common bitmap to quantize the R, G and B planes in color image. As one of the traditional SBBTC schemes, weighted plane (W-plane) method is famous for its simplicity and low time consumption. However, the W-plane method also has poor performance in visual quality. This paper proposes an improved SBBTC scheme based on W-plane method using parallel computing and hill climbing algorithm. Compared with various schemes, the simulation results of the proposed scheme are better than that of the reference schemes in visual quality and time consumption.

NAJun 12, 2015
Efficient algorithm for computing large scale systems of differential algebraic equations

Xiaolin Qin, Juan Tang, Yong Feng et al.

In many mathematical models of physical phenomenons and engineering fields, such as electrical circuits or mechanical multibody systems, which generate the differential algebraic equations (DAEs) systems naturally. In general, the feature of DAEs is a sparse large scale system of fully nonlinear and high index. To make use of its sparsity, this paper provides a simple and efficient algorithm for computing the large scale DAEs system. We exploit the shortest augmenting path algorithm for finding maximum value transversal (MVT) as well as block triangular forms (BTF). We also present the extended signature matrix method with the block fixed point iteration and its complexity results. Furthermore, a range of nontrivial problems are demonstrated by our algorithm.

NADec 19, 2014
Structural index reduction algorithms for differential algebraic equations via fixed-point iteration

Juan Tang, Wenyuan Wu, Xiaolin Qin et al.

Motivated by Pryce's structural index reduction method for differential algebraic equations (DAEs), we show the complexity of the fixed-point iteration algorithm and propose a fixed-point iteration method with parameters. It leads to a block fixed-point iteration method which can be applied to large-scale DAEs with block upper triangular structure. Moreover, its complexity analysis is also given in this paper.

SCJan 18, 2010
Parallel computation of real solving bivariate polynomial systems by zero-matching method

Xiaolin Qin, Yong Feng, Jingwei Chen et al.

We present a new algorithm for solving the real roots of a bivariate polynomial system $Σ=\{f(x,y),g(x,y)\}$ with a finite number of solutions by using a zero-matching method. The method is based on a lower bound for bivariate polynomial system when the system is non-zero. Moreover, the multiplicities of the roots of $Σ=0$ can be obtained by a given neighborhood. From this approach, the parallelization of the method arises naturally. By using a multidimensional matching method this principle can be generalized to the multivariate equation systems.