CVMar 4, 2024
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear PredictionWeiyi Lv, Yuhang Huang, Ning Zhang et al.
In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
15.4SYApr 8
Decision-focused Conservation Voltage Reduction to Consider the Cascading Impact of Forecast ErrorsQintao Du, Ran Li, Weiyi Lv et al.
Conservation Voltage Reduction (CVR) relies on the effective coordination of slow-acting devices, such as OLTCs and CBs, and fast-acting devices, such as SVGs and PV inverters, typically implemented through a hierarchical multi-stage Volt-Var Control (VVC) spanning day-ahead scheduling, intra-day dispatch, and real-time control. However, existing sequential methods fail to account for the cas-cading impact of forecast errors on multi-stage decision-making. This oversight results in suboptimal day-ahead schedules for OLTCs and CBs that hinder the ef-fective coordination with fast-acting SVGs and inverters, inevitably driving a trade-off between real-time voltage security and CVR efficiency. To improve the Pareto front of this trade-off, this paper proposes a novel bi-level multi-timescale forecasting (Bi-MTF) framework for multi-stage VVC optimization. By integrating the downstream multi-stage VVC optimization into the upstream forecasting mod-els training, the decision-focused forecasting models are able to learn the trade-offs across temporal horizons. To solve the computationally challenging bi-level for-mulation, a modified sensitivity-driven integer L-shaped method is developed. It utilizes a hybrid gradient feedback mechanism that integrates numerical sensitivity analysis for discrete variables with analytical dual information for continuous fore-cast parameters to ensure tractability. Numerical results on a modified IEEE 33-bus system demonstrate that the proposed approach yields superior energy savings and operational safety compared to conventional MSE-based sequential paradigms. Specifically, as the capacity of fast-acting devices increases, the energy savings of the proposed method rise from 2.74% to 3.41%, which is far superior to the 1.50% to 1.76% achieved by conventional MSE-based sequential paradigms.
CVNov 21, 2025
Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language ModelsWeiyi Lv, Ning Zhang, Hanyang Sun et al.
Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.