CVNov 14, 2025
SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory PredictionZhongping Dong, Pengyang Yu, Shuangjian Li et al.
Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.
CVMar 29, 2024
HARMamba: Efficient and Lightweight Wearable Sensor Human Activity Recognition Based on Bidirectional MambaShuangjian Li, Tao Zhu, Furong Duan et al.
Wearable sensor-based human activity recognition (HAR) is a critical research domain in activity perception. However, achieving high efficiency and long sequence recognition remains a challenge. Despite the extensive investigation of temporal deep learning models, such as CNNs, RNNs, and transformers, their extensive parameters often pose significant computational and memory constraints, rendering them less suitable for resource-constrained mobile health applications. This study introduces HARMamba, an innovative light-weight and versatile HAR architecture that combines selective bidirectional State Spaces Model and hardware-aware design. To optimize real-time resource consumption in practical scenarios, HARMamba employs linear recursive mechanisms and parameter discretization, allowing it to selectively focus on relevant input sequences while efficiently fusing scan and recompute operations. The model employs independent channels to process sensor data streams, dividing each channel into patches and appending classification tokens to the end of the sequence. It utilizes position embedding to represent the sequence order. The patch sequence is subsequently processed by HARMamba Block, and the classification head finally outputs the activity category. The HARMamba Block serves as the fundamental component of the HARMamba architecture, enabling the effective capture of more discriminative activity sequence features. HARMamba outperforms contemporary state-of-the-art frameworks, delivering comparable or better accuracy with significantly reducing computational and memory demands. It's effectiveness has been extensively validated on 4 publically available datasets namely PAMAP2, WISDM, UNIMIB SHAR and UCI. The F1 scores of HARMamba on the four datasets are 99.74%, 99.20%, 88.23% and 97.01%, respectively.
LGMar 16, 2025
HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU DatasetsLulu Ban, Tao Zhu, Xiangqing Lu et al.
Cross-dataset Human Activity Recognition (HAR) suffers from limited model generalization, hindering its practical deployment. To address this critical challenge, inspired by the success of DoReMi in Large Language Models (LLMs), we introduce a data mixture optimization strategy for pre-training HAR models, aiming to improve the recognition performance across heterogeneous datasets. However, directly applying DoReMi to the HAR field encounters new challenges due to the continuous, multi-channel and intrinsic heterogeneous characteristics of IMU sensor data. To overcome these limitations, we propose a novel framework HAR-DoReMi, which introduces a masked reconstruction task based on Mean Squared Error (MSE) loss. By raplacing the discrete language sequence prediction task, which relies on the Negative Log-Likelihood (NLL) loss, in the original DoReMi framework, the proposed framework is inherently more appropriate for handling the continuous and multi-channel characteristics of IMU data. In addition, HAR-DoReMi integrates the Mahony fusion algorithm into the self-supervised HAR pre-training, aiming to mitigate the heterogeneity of varying sensor orientation. This is achieved by estimating the sensor orientation within each dataset and facilitating alignment with a unified coordinate system, thereby improving the cross-dataset generalization ability of the HAR model. Experimental evaluation on multiple cross-dataset HAR transfer tasks demonstrates that HAR-DoReMi improves the accuracy by an average of 6.51%, compared to the current state-of-the-art method with only approximately 30% to 50% of the data usage. These results confirm the effectiveness of HAR-DoReMi in improving the generalization and data efficiency of pre-training HAR models, underscoring its significant potential to facilitate the practical deployment of HAR technology.
CVMar 13, 2024
P2LHAP:Wearable sensor-based human activity recognition, segmentation and forecast through Patch-to-Label Seq2Seq TransformerShuangjian Li, Tao Zhu, Mingxing Nie et al.
Traditional deep learning methods struggle to simultaneously segment, recognize, and forecast human activities from sensor data. This limits their usefulness in many fields such as healthcare and assisted living, where real-time understanding of ongoing and upcoming activities is crucial. This paper introduces P2LHAP, a novel Patch-to-Label Seq2Seq framework that tackles all three tasks in a efficient single-task model. P2LHAP divides sensor data streams into a sequence of "patches", served as input tokens, and outputs a sequence of patch-level activity labels including the predicted future activities. A unique smoothing technique based on surrounding patch labels, is proposed to identify activity boundaries accurately. Additionally, P2LHAP learns patch-level representation by sensor signal channel-independent Transformer encoders and decoders. All channels share embedding and Transformer weights across all sequences. Evaluated on three public datasets, P2LHAP significantly outperforms the state-of-the-art in all three tasks, demonstrating its effectiveness and potential for real-world applications.