LGAug 29, 2024Code
Variational Mode-Driven Graph Convolutional Network for Spatiotemporal Traffic ForecastingOsama Ahmad, Lukas Wesemann, Fabian Waschkowski et al.
This paper focuses on spatiotemporal (ST) traffic prediction using graph neural networks (GNNs). Given that ST data comprises non-stationary and complex temporal patterns, interpreting and predicting such trends is inherently challenging. Representing ST data in decomposed modes helps infer underlying behavior and assess the impact of noise on predictive performance. We propose a framework that decomposes ST data into interpretable modes using variational mode decomposition (VMD) and processes them through a neural network for future state forecasting. Unlike existing graph-based traffic forecasters that operate directly on raw or aggregated time series, the proposed hybrid approach, termed the Variational Mode Graph Convolutional Network (VMGCN), first decomposes non-stationary signals into interpretable variational modes by determining the optimal mode count via reconstruction-loss minimization and then learns both intramode and cross-mode spatiotemporal dependencies through a novel attention-augmented GCN. Additionally, we analyze the significance of each mode and the effect of bandwidth constraints on multi-horizon traffic flow predictions. The proposed two-stage design yields significant accuracy gains while providing frequency-level interpretability with demonstrated superior performance on the LargeST dataset for both short-term and long-term forecasting tasks. The implementation is publicly available on https://github.com/OsamaAhmad369/VMGCN.
CVSep 30, 2025
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language ModelsXinyu Tian, Shu Zou, Zhaoyuan Yang et al.
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/
CVOct 2, 2025
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained PromptingShu Zou, Xinyu Tian, Lukas Wesemann et al.
Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
LGAug 31, 2025
Robust Spatiotemporal Forecasting Using Adaptive Deep-Unfolded Variational Mode DecompositionOsama Ahmad, Lukas Wesemann, Fabian Waschkowski et al.
Accurate spatiotemporal forecasting is critical for numerous complex systems but remains challenging due to complex volatility patterns and spectral entanglement in conventional graph neural networks (GNNs). While decomposition-integrated approaches like variational mode graph convolutional network (VMGCN) improve accuracy through signal decomposition, they suffer from computational inefficiency and manual hyperparameter tuning. To address these limitations, we propose the mode adaptive graph network (MAGN) that transforms iterative variational mode decomposition (VMD) into a trainable neural module. Our key innovations include (1) an unfolded VMD (UVMD) module that replaces iterative optimization with a fixed-depth network to reduce the decomposition time (by 250x for the LargeST benchmark), and (2) mode-specific learnable bandwidth constraints (αk ) adapt spatial heterogeneity and eliminate manual tuning while preventing spectral overlap. Evaluated on the LargeST benchmark (6,902 sensors, 241M observations), MAGN achieves an 85-95% reduction in the prediction error over VMGCN and outperforms state-of-the-art baselines.