CVMay 27Code
Dual-branch Distilled Transformer for Efficient Asymmetric UAV TrackingHongtao Yang, Bineng Zhong, Qihua Liang et al.
Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack
CVDec 20, 2024Code
Exploiting Multimodal Spatial-temporal Patterns for Video Object TrackingXiantao Hu, Ying Tai, Xu Zhao et al.
Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.
CVFeb 24, 2025Code
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual GroundingLiangtao Shi, Ting Liu, Xiantao Hu et al.
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
CVSep 21, 2025Code
When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image RestorationWenxuan Fang, Jili Fan, Chao Wang et al.
Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbf{LCDiff}, a novel framework comprising two key components: \textit{Lumina-Chroma Decomposition Network} (LCDN) and \textit{Lumina-Guided Diffusion Model} (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textit{Dynamic Time Step Loss} to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: https://github.com/fiwy0527/LCDiff.
CVJul 22, 2025Code
Explicit Context Reasoning with Supervision for Visual TrackingFansheng Zeng, Bineng Zhong, Haiying Xia et al.
Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target's evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.
CVNov 22, 2025Code
CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT TrackingHao Li, Yuhao Wang, Xiantao Hu et al.
RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method. The source code is released at https://github.com/IdolLab/CADTrack.
CVFeb 10, 2025
Adaptive Perception for Unified Visual Multi-modal Object TrackingXiantao Hu, Bineng Zhong, Qihua Liang et al.
Recently, many multi-modal trackers prioritize RGB as the dominant modality, treating other modalities as auxiliary, and fine-tuning separately various multi-modal tasks. This imbalance in modality dependence limits the ability of methods to dynamically utilize complementary information from each modality in complex scenarios, making it challenging to fully perceive the advantages of multi-modal. As a result, a unified parameter model often underperforms in various multi-modal tracking tasks. To address this issue, we propose APTrack, a novel unified tracker designed for multi-modal adaptive perception. Unlike previous methods, APTrack explores a unified representation through an equal modeling strategy. This strategy allows the model to dynamically adapt to various modalities and tasks without requiring additional fine-tuning between different tasks. Moreover, our tracker integrates an adaptive modality interaction (AMI) module that efficiently bridges cross-modality interactions by generating learnable tokens. Experiments conducted on five diverse multi-modal datasets (RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022) demonstrate that APTrack not only surpasses existing state-of-the-art unified multi-modal trackers but also outperforms trackers designed for specific multi-modal tasks.
CVJun 19, 2025
Learning Multi-scale Spatial-frequency Features for Image DenoisingXu Zhao, Chen Zhao, Xiantao Hu et al.
Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequency and low-frequency noise. In this paper, we propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising. We use image pyramid inputs to restore noise-free results from low-resolution images. In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit (ASFU), where a learnable mask is used to separate the information into high-frequency and low-frequency components. In the skip connections, we design a global feature fusion block to enhance the features at different scales. Extensive experiments on both synthetic and real noisy image datasets verify the effectiveness of MADNet compared with current state-of-the-art denoising approaches.
CVApr 21, 2025
Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on VideosSongping Wang, Hanqing Liu, Yueming Lyu et al.
Adversarial Training (AT) has been shown to significantly enhance adversarial robustness via a min-max optimization approach. However, its effectiveness in video recognition tasks is hampered by two main challenges. First, fast adversarial training for video models remains largely unexplored, which severely impedes its practical applications. Specifically, most video adversarial training methods are computationally costly, with long training times and high expenses. Second, existing methods struggle with the trade-off between clean accuracy and adversarial robustness. To address these challenges, we introduce Video Fast Adversarial Training with Weak-to-Strong consistency (VFAT-WS), the first fast adversarial training method for video data. Specifically, VFAT-WS incorporates the following key designs: First, it integrates a straightforward yet effective temporal frequency augmentation (TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a single-step PGD attack to boost training efficiency and robustness. Second, it devises a weak-to-strong spatial-temporal consistency regularization, which seamlessly integrates the simpler TF-AUG and the more complex STF-AUG. Leveraging the consistency regularization, it steers the learning process from simple to complex augmentations. Both of them work together to achieve a better trade-off between clean accuracy and robustness. Extensive experiments on UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that VFAT-WS achieves great improvements in adversarial robustness and corruption robustness, while accelerating training by nearly 490%.
CVJun 11, 2025
An Effective End-to-End Solution for Multimodal Action RecognitionSongping Wang, Xiantao Hu, Yueming Lyu et al.
Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational efficiency. In addition, common prediction enhancement methods, such as Stochastic Weight Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to integrate the knowledge of models from different training periods of the same architecture and different architectures, so as to predict the actions from different perspectives and fully exploit the target information. Ultimately, we achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.
LGJun 24, 2024
Personalized federated learning based on feature fusionWolong Xing, Zhenkui Shi, Hongyan Peng et al.
Federated learning enables distributed clients to collaborate on training while storing their data locally to protect client privacy. However, due to the heterogeneity of data, models, and devices, the final global model may need to perform better for tasks on each client. Communication bottlenecks, data heterogeneity, and model heterogeneity have been common challenges in federated learning. In this work, we considered a label distribution skew problem, a type of data heterogeneity easily overlooked. In the context of classification, we propose a personalized federated learning approach called pFedPM. In our process, we replace traditional gradient uploading with feature uploading, which helps reduce communication costs and allows for heterogeneous client models. These feature representations play a role in preserving privacy to some extent. We use a hyperparameter $a$ to mix local and global features, which enables us to control the degree of personalization. We also introduced a relation network as an additional decision layer, which provides a non-linear learnable classifier to predict labels. Experimental results show that, with an appropriate setting of $a$, our scheme outperforms several recent FL methods on MNIST, FEMNIST, and CRIFAR10 datasets and achieves fewer communications.