LGOct 8, 2022Code
Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-ExpertsTao Zhong, Zhixiang Chi, Li Gu et al. · princeton
In this paper, we tackle the problem of domain shift. Most existing methods perform training on multiple source domains using a single model, and the same trained model is used on all unseen target domains. Such solutions are sub-optimal as each target domain exhibits its own specialty, which is not adapted. Furthermore, expecting single-model training to learn extensive knowledge from multiple source domains is counterintuitive. The model is more biased toward learning only domain-invariant features and may result in negative knowledge transfer. In this work, we propose a novel framework for unsupervised test-time adaptation, which is formulated as a knowledge distillation process to address domain shift. Specifically, we incorporate Mixture-of-Experts (MoE) as teachers, where each expert is separately trained on different source domains to maximize their specialty. Given a test-time target domain, a small set of unlabeled data is sampled to query the knowledge from MoE. As the source domains are correlated to the target domains, a transformer-based aggregator then combines the domain knowledge by examining the interconnection among them. The output is treated as a supervision signal to adapt a student prediction network toward the target domain. We further employ meta-learning to enforce the aggregator to distill positive knowledge and the student network to achieve fast adaptation. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art and validates the effectiveness of each proposed component. Our code is available at https://github.com/n3il666/Meta-DMoE.
CVAug 31, 2023Code
Illumination Distillation Framework for Nighttime Person Re-Identification and A New BenchmarkAndong Lu, Zhang Zhang, Yan Huang et al.
Nighttime person Re-ID (person re-identification in the nighttime) is a very important and challenging task for visual surveillance but it has not been thoroughly investigated. Under the low illumination condition, the performance of person Re-ID methods usually sharply deteriorates. To address the low illumination challenge in nighttime person Re-ID, this paper proposes an Illumination Distillation Framework (IDF), which utilizes illumination enhancement and illumination distillation schemes to promote the learning of Re-ID models. Specifically, IDF consists of a master branch, an illumination enhancement branch, and an illumination distillation module. The master branch is used to extract the features from a nighttime image. The illumination enhancement branch first estimates an enhanced image from the nighttime image using a nonlinear curve mapping method and then extracts the enhanced features. However, nighttime and enhanced features usually contain data noise due to unstable lighting conditions and enhancement failures. To fully exploit the complementary benefits of nighttime and enhanced features while suppressing data noise, we propose an illumination distillation module. In particular, the illumination distillation module fuses the features from two branches through a bottleneck fusion model and then uses the fused features to guide the learning of both branches in a distillation manner. In addition, we build a real-world nighttime person Re-ID dataset, named Night600, which contains 600 identities captured from different viewpoints and nighttime illumination conditions under complex outdoor environments. Experimental results demonstrate that our IDF can achieve state-of-the-art performance on two nighttime person Re-ID datasets (i.e., Night600 and Knight ). We will release our code and dataset at https://github.com/Alexadlu/IDF.
CVAug 8, 2023Code
SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based RecognitionXiao Wang, Yao Rong, Zongzhen Wu et al.
Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer
CVSep 26, 2023Code
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel BaselineXiao Wang, Shiao Wang, Chuanming Tang et al.
Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}
CVJul 22, 2022
Few-Shot Class-Incremental Learning via Entropy-Regularized Data-Free ReplayHuan Liu, Li Gu, Zhixiang Chi et al.
Few-shot class-incremental learning (FSCIL) has been proposed aiming to enable a deep learning system to incrementally learn new classes with limited data. Recently, a pioneer claims that the commonly used replay-based method in class-incremental learning (CIL) is ineffective and thus not preferred for FSCIL. This has, if truth, a significant influence on the fields of FSCIL. In this paper, we show through empirical results that adopting the data replay is surprisingly favorable. However, storing and replaying old data can lead to a privacy concern. To address this issue, we alternatively propose using data-free replay that can synthesize data by a generator without accessing real data. In observing the the effectiveness of uncertain data for knowledge distillation, we impose entropy regularization in the generator training to encourage more uncertain examples. Moreover, we propose to relabel the generated data with one-hot-like labels. This modification allows the network to learn by solely minimizing the cross-entropy loss, which mitigates the problem of balancing different objectives in the conventional knowledge distillation approach. Finally, we show extensive experimental results and analysis on CIFAR-100, miniImageNet and CUB-200 to demonstrate the effectiveness of our proposed one.
CVSep 22, 2022
Detecting Rotated Objects as Gaussian Distributions and Its 3-D GeneralizationXue Yang, Gefan Zhang, Xiaojiang Yang et al.
Existing detection methods commonly use a parameterized bounding box (BBox) to model and detect (horizontal) objects and an additional rotation angle parameter is used for rotated objects. We argue that such a mechanism has fundamental limitations in building an effective regression loss for rotation detection, especially for high-precision detection with high IoU (e.g. 0.75). Instead, we propose to model the rotated objects as Gaussian distributions. A direct advantage is that our new regression loss regarding the distance between two Gaussians e.g. Kullback-Leibler Divergence (KLD), can well align the actual detection performance metric, which is not well addressed in existing methods. Moreover, the two bottlenecks i.e. boundary discontinuity and square-like problem also disappear. We also propose an efficient Gaussian metric-based label assignment strategy to further boost the performance. Interestingly, by analyzing the BBox parameters' gradients under our Gaussian-based KLD loss, we show that these parameters are dynamically updated with interpretable physical meaning, which help explain the effectiveness of our approach, especially for high-precision detection. We extend our approach from 2-D to 3-D with a tailored algorithm design to handle the heading estimation, and experimental results on twelve public datasets (2-D/3-D, aerial/text/face images) with various base detectors show its superiority.
69.0CVJun 1
Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt FactorizationYumiao Zhao, Bo Jiang, Beibei Wang et al.
Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.
CVAug 1, 2022
Multi-spectral Vehicle Re-identification with Cross-directional Consistency Network and a High-quality BenchmarkAihua Zheng, Xianpeng Zhu, Zhiqi Ma et al.
To tackle the challenge of vehicle re-identification (Re-ID) in complex lighting environments and diverse scenes, multi-spectral sources like visible and infrared information are taken into consideration due to their excellent complementary advantages. However, multi-spectral vehicle Re-ID suffers cross-modality discrepancy caused by heterogeneous properties of different modalities as well as a big challenge of the diverse appearance with different views in each identity. Meanwhile, diverse environmental interference leads to heavy sample distributional discrepancy in each modality. In this work, we propose a novel cross-directional consistency network to simultaneously overcome the discrepancies from both modality and sample aspects. In particular, we design a new cross-directional center loss to pull the modality centers of each identity close to mitigate cross-modality discrepancy, while the sample centers of each identity close to alleviate the sample discrepancy. Such strategy can generate discriminative multi-spectral feature representations for vehicle Re-ID. In addition, we design an adaptive layer normalization unit to dynamically adjust individual feature distribution to handle distributional discrepancy of intra-modality features for robust learning. To provide a comprehensive evaluation platform, we create a high-quality RGB-NIR-TIR multi-spectral vehicle Re-ID benchmark (MSVR310), including 310 different vehicles from a broad range of viewpoints, time spans and environmental complexities. Comprehensive experiments on both created and public datasets demonstrate the effectiveness of the proposed approach comparing to the state-of-the-art methods.
CVAug 2, 2024Code
Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion ApproachYabin Zhu, Qianwu Wang, Chenglong Li et al.
The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in https://github.com/wqw123wqw/PFTrack.
CVJul 15, 2024Code
An Empirical Study of Mamba-based Pedestrian Attribute RecognitionXiao Wang, Weizhe Kong, Jiandong Jin et al.
Current strong pedestrian attribute recognition models are developed based on Transformer networks, which are computationally heavy. Recently proposed models with linear complexity (e.g., Mamba) have garnered significant attention and have achieved a good balance between accuracy and computational cost across a variety of visual tasks. Relevant review articles also suggest that while these models can perform well on some pedestrian attribute recognition datasets, they are generally weaker than the corresponding Transformer models. To further tap into the potential of the novel Mamba architecture for PAR tasks, this paper designs and adapts Mamba into two typical PAR frameworks, i.e., the text-image fusion approach and pure vision Mamba multi-label recognition framework. It is found that interacting with attribute tags as additional input does not always lead to an improvement, specifically, Vim can be enhanced, but VMamba cannot. This paper further designs various hybrid Mamba-Transformer variants and conducts thorough experimental validations. These experimental results indicate that simply enhancing Mamba with a Transformer does not always lead to performance improvements but yields better results under certain settings. We hope this empirical study can further inspire research in Mamba for PAR, and even extend into the domain of multi-label recognition, through the design of these network structures and comprehensive experimentation. The source code of this work will be released at \url{https://github.com/Event-AHU/OpenPAR}
CVMay 19, 2022
Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam SearchXiao Wang, Zhe Chen, Bo Jiang et al.
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
62.9CVMar 11Code
UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes EquationsDengdi Sun, Jie Chen, Xiao Wang et al.
Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event-AHU/OpenFusion
CVMar 26, 2023
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided LearningYabin Zhu, Chenglong Li, Xiao Wang et al.
Existing Transformer-based RGBT tracking methods either use cross-attention to fuse the two modalities, or use self-attention and cross-attention to model both modality-specific and modality-sharing information. However, the significant appearance gap between modalities limits the feature representation ability of certain modalities during the fusion process. To address this problem, we propose a novel Progressive Fusion Transformer called ProFormer, which progressively integrates single-modality information into the multimodal representation for robust RGBT tracking. In particular, ProFormer first uses a self-attention module to collaboratively extract the multimodal representation, and then uses two cross-attention modules to interact it with the features of the dual modalities respectively. In this way, the modality-specific information can well be activated in the multimodal representation. Finally, a feed-forward network is used to fuse two interacted multimodal representations for the further enhancement of the final multimodal representation. In addition, existing learning methods of RGBT trackers either fuse multimodal features into one for final classification, or exploit the relationship between unimodal branches and fused branch through a competitive learning strategy. However, they either ignore the learning of single-modality branches or result in one branch failing to be well optimized. To solve these problems, we propose a dynamically guided learning algorithm that adaptively uses well-performing branches to guide the learning of other branches, for enhancing the representation ability of each branch. Extensive experiments demonstrate that our proposed ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.
CVJul 25, 2022
Error-Aware Spatial Ensembles for Video Frame InterpolationZhixiang Chi, Rasoul Mohammadi Nasiri, Zheng Liu et al.
Video frame interpolation~(VFI) algorithms have improved considerably in recent years due to unprecedented progress in both data-driven algorithms and their implementations. Recent research has introduced advanced motion estimation or novel warping methods as the means to address challenging VFI scenarios. However, none of the published VFI works considers the spatially non-uniform characteristics of the interpolation error (IE). This work introduces such a solution. By closely examining the correlation between optical flow and IE, the paper proposes novel error prediction metrics that partition the middle frame into distinct regions corresponding to different IE levels. Building upon this IE-driven segmentation, and through the use of novel error-controlled loss functions, it introduces an ensemble of spatially adaptive interpolation units that progressively processes and integrates the segmented regions. This spatial ensemble results in an effective and computationally attractive VFI solution. Extensive experimentation on popular video interpolation benchmarks indicates that the proposed solution outperforms the current state-of-the-art (SOTA) in applications of current interest.
CVJun 8, 2023
Point-Voxel Absorbing Graph Representation Learning for Event Stream based RecognitionBo Jiang, Chengguo Yuan, Xiao Wang et al.
Sampled point and voxel methods are usually employed to downsample the dense events into sparse ones. After that, one popular way is to leverage a graph model which treats the sparse points/voxels as nodes and adopts graph neural networks (GNNs) to learn the representation of event data. Although good performance can be obtained, however, their results are still limited mainly due to two issues. (1) Existing event GNNs generally adopt the additional max (or mean) pooling layer to summarize all node embeddings into a single graph-level representation for the whole event data representation. However, this approach fails to capture the importance of graph nodes and also fails to be fully aware of the node representations. (2) Existing methods generally employ either a sparse point or voxel graph representation model which thus lacks consideration of the complementary between these two types of representation models. To address these issues, we propose a novel dual point-voxel absorbing graph representation learning for event stream data representation. To be specific, given the input event stream, we first transform it into the sparse event cloud and voxel grids and build dual absorbing graph models for them respectively. Then, we design a novel absorbing graph convolutional network (AGCN) for our dual absorbing graph representation and learning. The key aspect of the proposed AGCN is its ability to effectively capture the importance of nodes and thus be fully aware of node representations in summarizing all node representations through the introduced absorbing nodes. Extensive experiments on multiple event-based classification benchmark datasets fully validated the effectiveness of our framework.
CVJun 2, 2022
Disentangled Generation Network for Enlarged License Plate Recognition and A Unified DatasetChenglong Li, Xiaobin Yang, Guohao Wang et al.
License plate recognition plays a critical role in many practical applications, but license plates of large vehicles are difficult to be recognized due to the factors of low resolution, contamination, low illumination, and occlusion, to name a few. To overcome the above factors, the transportation management department generally introduces the enlarged license plate behind the rear of a vehicle. However, enlarged license plates have high diversity as they are non-standard in position, size, and style. Furthermore, the background regions contain a variety of noisy information which greatly disturbs the recognition of license plate characters. Existing works have not studied this challenging problem. In this work, we first address the enlarged license plate recognition problem and contribute a dataset containing 9342 images, which cover most of the challenges of real scenes. However, the created data are still insufficient to train deep methods of enlarged license plate recognition, and building large-scale training data is very time-consuming and high labor cost. To handle this problem, we propose a novel task-level disentanglement generation framework based on the Disentangled Generation Network (DGNet), which disentangles the generation into the text generation and background generation in an end-to-end manner to effectively ensure diversity and integrity, for robust enlarged license plate recognition. Extensive experiments on the created dataset are conducted, and we demonstrate the effectiveness of the proposed approach in three representative text recognition frameworks.
79.1CVMar 11Code
UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmarkYu Zhang, Zhicheng Zhao, Ze Luo et al.
Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.
62.9CVMay 11
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-IdentificationShihao Li, Huaibo Huang, Junxian Duan et al.
Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarsegrained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained expert features into the final identity representations. Extensive experiments on two public person datasets and three vehicle datasets demonstrate the effectiveness of our method, showing that it significantly outperforms existing state-of-the-art methods.
CVAug 23, 2024Code
VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation ModelsWentao Wu, Fanghua Hong, Xiao Wang et al.
Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle's semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using VehicleMAE. More importantly, we propose a new VAtt2Vec module that predicts the vehicle semantic attributes of these proposals and transforms them into feature vectors to enhance the vision features via contrastive learning. Extensive experiments on three vehicle detection benchmark datasets thoroughly proved the effectiveness of our vehicle detector. Specifically, our model improves the baseline approach by $+5.1\%$, $+6.2\%$ on the $AP_{0.5}$, $AP_{0.75}$ metrics, respectively, on the Cityscapes dataset.The source code of this work will be released at https://github.com/Event-AHU/VFM-Det.
CVJan 16Code
FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly DetectionCheng-Zhuang Liu, Si-Bao Chen, Qing-Ling Shu et al.
Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba's sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: https://github.com/uavano/FTDMamba.
CVSep 25, 2022
Hand Hygiene Assessment via Joint Step Segmentation and Key Action ScorerChenglong Li, Qiwen Zhu, Tubiao Liu et al.
Hand hygiene is a standard six-step hand-washing action proposed by the World Health Organization (WHO). However, there is no good way to supervise medical staff to do hand hygiene, which brings the potential risk of disease spread. Existing action assessment works usually make an overall quality prediction on an entire video. However, the internal structures of hand hygiene action are important in hand hygiene assessment. Therefore, we propose a novel fine-grained learning framework to perform step segmentation and key action scorer in a joint manner for accurate hand hygiene assessment. Existing temporal segmentation methods usually employ multi-stage convolutional network to improve the segmentation robustness, but easily lead to over-segmentation due to the lack of the long-range dependence. To address this issue, we design a multi-stage convolution-transformer network for step segmentation. Based on the observation that each hand-washing step involves several key actions which determine the hand-washing quality, we design a set of key action scorers to evaluate the quality of key actions in each step. In addition, there lacks a unified dataset in hand hygiene assessment. Therefore, under the supervision of medical staff, we contribute a video dataset that contains 300 video sequences with fine-grained annotations. Extensive experiments on the dataset suggest that our method well assesses hand hygiene videos and achieves outstanding performance.
SDSep 10, 2024Code
MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event DetectionZehao Wang, Haobo Yue, Zhicheng Zhang et al.
Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{$5\%$} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.
CVAug 16, 2024
RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion MambaAndong Lu, Wanyu Wang, Chenglong Li et al.
Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.
LGApr 15, 2024Code
State Space Model for New-Generation Network Alternative to Transformers: A SurveyXiao Wang, Shiao Wang, Yuhe Ding et al.
In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List.
LGJan 1
Robust Graph Fine-Tuning with Adversarial Graph PromptingZiyan Zhang, Bo Jiang, Jin Tang
Parameter-Efficient Fine-Tuning (PEFT) method has emerged as a dominant paradigm for adapting pre-trained GNN models to downstream tasks. However, existing PEFT methods usually exhibit significant vulnerability to various noise and attacks on graph topology and node attributes/features. To address this issue, for the first time, we propose integrating adversarial learning into graph prompting and develop a novel Adversarial Graph Prompting (AGP) framework to achieve robust graph fine-tuning. Our AGP has two key aspects. First, we propose the general problem formulation of AGP as a min-max optimization problem and develop an alternating optimization scheme to solve it. For inner maximization, we propose Joint Projected Gradient Descent (JointPGD) algorithm to generate strong adversarial noise. For outer minimization, we employ a simple yet effective module to learn the optimal node prompts to counteract the adversarial noise. Second, we demonstrate that the proposed AGP can theoretically address both graph topology and node noise. This confirms the versatility and robustness of our AGP fine-tuning method across various graph noise. Note that, the proposed AGP is a general method that can be integrated with various pre-trained GNN models to enhance their robustness on the downstream tasks. Extensive experiments on multiple benchmark tasks validate the robustness and effectiveness of AGP method compared to state-of-the-art methods.
91.8CVMay 7Code
T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle RetrievalXiao Wang, Ziwen Wang, Weizhe Kong et al.
Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large-scale dataset called T2I-VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine-grained part-level annotations. Experimental results on the T2I-VeRI dataset show that PFCVR achieves 29.2\% Rank-1 accuracy, improving over the best competing method by +3.7\% percentage points. On the newly proposed T2I-VeRW benchmark, PFCVR achieves 55.2\% Rank-1 accuracy, outperforming a comprehensive set of recent state-of-the-art methods. Source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
CVMay 8, 2022
Past and Future Motion Guided Network for Audio Visual Event LocalizationTingxiu Chen, Jianqin Yin, Jin Tang
In recent years, audio-visual event localization has attracted much attention. It's purpose is to detect the segment containing audio-visual events and recognize the event category from untrimmed videos. Existing methods use audio-guided visual attention to lead the model pay attention to the spatial area of the ongoing event, devoting to the correlation between audio and visual information but ignoring the correlation between audio and spatial motion. We propose a past and future motion extraction (pf-ME) module to mine the visual motion from videos ,embedded into the past and future motion guided network (PFAGN), and motion guided audio attention (MGAA) module to achieve focusing on the information related to interesting events in audio modality through the past and future visual motion. We choose AVE as the experimental verification dataset and the experiments show that our method outperforms the state-of-the-arts in both supervised and weakly-supervised settings.
CVDec 17, 2023Code
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language FusionXiao Wang, Jiandong Jin, Chenglong Li et al.
Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
CVDec 22, 2023Code
A Multi-Stage Adaptive Feature Fusion Neural Network for Multimodal Gait RecognitionShinan Zou, Jianbo Xiong, Chao Fan et al.
Gait recognition is a biometric technology that has received extensive attention. Most existing gait recognition algorithms are unimodal, and a few multimodal gait recognition algorithms perform multimodal fusion only once. None of these algorithms may fully exploit the complementary advantages of the multiple modalities. In this paper, by considering the temporal and spatial characteristics of gait data, we propose a multi-stage feature fusion strategy (MSFFS), which performs multimodal fusions at different stages in the feature extraction process. Also, we propose an adaptive feature fusion module (AFFM) that considers the semantic association between silhouettes and skeletons. The fusion process fuses different silhouette areas with their more related skeleton joints. Since visual appearance changes and time passage co-occur in a gait period, we propose a multiscale spatial-temporal feature extractor (MSSTFE) to learn the spatial-temporal linkage features thoroughly. Specifically, MSSTFE extracts and aggregates spatial-temporal linkages information at different spatial scales. Combining the strategy and modules mentioned above, we propose a multi-stage adaptive feature fusion (MSAFF) neural network, which shows state-of-the-art performance in many experiments on three datasets. Besides, MSAFF is equipped with feature dimensional pooling (FD Pooling), which can significantly reduce the dimension of the gait representations without hindering the accuracy. https://github.com/ShinanZou/MSAFF
CVDec 31, 2022
An end-to-end multi-scale network for action prediction in videosXiaofa Liu, Jianqin Yin, Yuan Sun et al.
In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.
CVDec 22, 2023Code
Cross-Covariate Gait Recognition: A BenchmarkShinan Zou, Chao Fan, Jianbo Xiong et al.
Gait datasets are essential for gait research. However, this paper observes that present benchmarks, whether conventional constrained or emerging real-world datasets, fall short regarding covariate diversity. To bridge this gap, we undertake an arduous 20-month effort to collect a cross-covariate gait recognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6 million sequences; almost every subject has 33 views and 53 different covariates. Compared to existing datasets, CCGR has both population and individual-level diversity. In addition, the views and covariates are well labeled, enabling the analysis of the effects of different factors. CCGR provides multiple types of gait data, including RGB, parsing, silhouette, and pose, offering researchers a comprehensive resource for exploration. In order to delve deeper into addressing cross-covariate gait recognition, we propose parsing-based gait recognition (ParsingGait) by utilizing the newly proposed parsing data. We have conducted extensive experiments. Our main results show: 1) Cross-covariate emerges as a pivotal challenge for practical applications of gait recognition. 2) ParsingGait demonstrates remarkable potential for further advancement. 3) Alarmingly, existing SOTA methods achieve less than 43% accuracy on the CCGR, highlighting the urgency of exploring cross-covariate gait recognition. Link: https://github.com/ShinanZou/CCGR.
CVNov 26, 2025Code
SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change CaptioningFutian Wang, Mengqi Wang, Xiao Wang et al.
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
CVNov 26, 2025Code
EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream AggregationFutian Wang, Fan Zhang, Xiao Wang et al.
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
SDAug 9, 2024
SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance EstimationDa Mu, Zhicheng Zhang, Haobo Yue et al.
In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.
CVDec 25, 2023Code
Modality-missing RGBT Tracking: Invertible Prompt Learning and High-quality BenchmarksAndong Lu, Jiacong Zhao, Chenglong Li et al.
Current RGBT tracking research relies on the complete multi-modal input, but modal information might miss due to some factors such as thermal sensor self-calibration and data transmission error, called modality-missing challenge in this work. To address this challenge, we propose a novel invertible prompt learning approach, which integrates the content-preserving prompts into a well-trained tracking model to adapt to various modality-missing scenarios, for robust RGBT tracking. Given one modality-missing scenario, we propose to utilize the available modality to generate the prompt of the missing modality to adapt to RGBT tracking model. However, the cross-modality gap between available and missing modalities usually causes semantic distortion and information loss in prompt generation. To handle this issue, we design the invertible prompter by incorporating the full reconstruction of the input available modality from the generated prompt. To provide a comprehensive evaluation platform, we construct several high-quality benchmark datasets, in which various modality-missing scenarios are considered to simulate real-world challenges. Extensive experiments on three modality-missing benchmark datasets show that our method achieves significant performance improvements compared with state-of-the-art methods. We have released the code and simulation datasets at: \href{https://github.com/Alexadlu/Modality-missing-RGBT-Tracking.git}{https://github.com/Alexadlu/Modality-missing-RGBT-Tracking.git}.
40.8CVApr 29Code
Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale BenchmarkFangqiang Fan, Zhicheng Zhao, Xiaoliang Ma et al.
Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories.In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
CVDec 4, 2023Code
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation ParadigmJiandong Jin, Xiao Wang, Yin Lin et al.
Current pedestrian attribute recognition (PAR) algorithms use multi-label or multi-task learning frameworks with specific classification heads. These models often struggle with imbalanced data and noisy samples. Inspired by the success of generative models, we propose Sequence Pedestrian Attribute Recognition (SequencePAR), a novel sequence generation paradigm for PAR. SequencePAR extracts pedestrian features using a language-image pre-trained model and embeds the attribute set into query tokens guided by text prompts. A Transformer decoder generates human attributes by integrating visual features and attribute query tokens. The masked multi-head attention layer in the decoder prevents the model from predicting the next attribute during training. The extensive experiments on multiple PAR datasets validate the effectiveness of SequencePAR. Specifically, we achieve 84.92\%, 90.44\%, 90.73\%, and 90.46\% in accuracy, precision, recall, and F1-score on the PETA dataset. The source code and pre-trained models are available at https://github.com/Event-AHU/OpenPAR.
CVJan 5, 2024Code
CRSOT: Cross-Resolution Object Tracking using Unaligned Frame and Event CamerasYabin Zhu, Xiao Wang, Chenglong Li et al.
Existing datasets for RGB-DVS tracking are collected with DVS346 camera and their resolution ($346 \times 260$) is low for practical applications. Actually, only visible cameras are deployed in many practical systems, and the newly designed neuromorphic cameras may have different resolutions. The latest neuromorphic sensors can output high-definition event streams, but it is very difficult to achieve strict alignment between events and frames on both spatial and temporal views. Therefore, how to achieve accurate tracking with unaligned neuromorphic and visible sensors is a valuable but unresearched problem. In this work, we formally propose the task of object tracking using unaligned neuromorphic and visible cameras. We build the first unaligned frame-event dataset CRSOT collected with a specially built data acquisition system, which contains 1,030 high-definition RGB-Event video pairs, 304,974 video frames. In addition, we propose a novel unaligned object tracking framework that can realize robust tracking even using the loosely aligned RGB-Event data. Specifically, we extract the template and search regions of RGB and Event data and feed them into a unified ViT backbone for feature embedding. Then, we propose uncertainty perception modules to encode the RGB and Event features, respectively, then, we propose a modality uncertainty fusion module to aggregate the two modalities. These three branches are jointly optimized in the training phase. Extensive experiments demonstrate that our tracker can collaborate the dual modalities for high-performance tracking even without strictly temporal and spatial alignment. The source code, dataset, and pre-trained models will be released at https://github.com/Event-AHU/Cross_Resolution_SOT.
67.2LGApr 13
Unified Graph Prompt Learning via Low-Rank Graph Message PromptingBeibei Wang, Bo Jiang, Ziyan Zhang et al.
Graph Data Prompt (GDP), which introduces specific prompts in graph data for efficiently adapting pre-trained GNNs, has become a mainstream approach to graph fine-tuning learning problem. However, existing GDPs have been respectively designed for distinct graph component (e.g., node features, edge features, edge weights) and thus operate within limited prompt spaces for graph data. To the best of our knowledge, it still lacks a unified prompter suitable for targeting all graph components simultaneously. To address this challenge, in this paper, we first propose to reinterpret a wide range of existing GDPs from an aspect of Graph Message Prompt (GMP) paradigm. Based on GMP, we then introduce a novel graph prompt learning approach, termed Low-Rank GMP (LR-GMP), which leverages low-rank prompt representation to achieve an effective and compact graph prompt learning. Unlike traditional GDPs that target distinct graph components separately, LR-GMP concurrently performs prompting on all graph components in a unified manner, thereby achieving significantly superior generalization and robustness on diverse downstream tasks. Extensive experiments on several graph benchmark datasets demonstrate the effectiveness and advantages of our proposed LR-GMP.
CVDec 27, 2025
Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion FrameworkZhicheng Zhao, Yuancheng Xu, Andong Lu et al.
Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.
CVMay 4, 2024Code
AFter: Attention-based Fusion Router for RGBT TrackingAndong Lu, Wanyu Wang, Chenglong Li et al.
Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in https://github.com/Alexadlu/AFter.
CVDec 25, 2023Code
Nighttime Person Re-Identification via Collaborative Enhancement Network with Multi-domain LearningAndong Lu, Chenglong Li, Tianrui Zha et al.
Prevalent nighttime person re-identification (ReID) methods typically combine image relighting and ReID networks in a sequential manner. However, their performance (recognition accuracy) is limited by the quality of relighting images and insufficient collaboration between image relighting and ReID tasks. To handle these problems, we propose a novel Collaborative Enhancement Network called CENet, which performs the multilevel feature interactions in a parallel framework, for nighttime person ReID. In particular, the designed parallel structure of CENet can not only avoid the impact of the quality of relighting images on ReID performance, but also allow us to mine the collaborative relations between image relighting and person ReID tasks. To this end, we integrate the multilevel feature interactions in CENet, where we first share the Transformer encoder to build the low-level feature interaction, and then perform the feature distillation that transfers the high-level features from image relighting to ReID, thereby alleviating the severe image degradation issue caused by the nighttime scenario while avoiding the impact of relighting images. In addition, the sizes of existing real-world nighttime person ReID datasets are limited, and large-scale synthetic ones exhibit substantial domain gaps with real-world data. To leverage both small-scale real-world and large-scale synthetic training data, we develop a multi-domain learning algorithm, which alternately utilizes both kinds of data to reduce the inter-domain difference in training procedure. Extensive experiments on two real nighttime datasets, \textit{Night600} and \textit{RGBNT201$_{rgb}$}, and a synthetic nighttime ReID dataset are conducted to validate the effectiveness of CENet. We release the code and synthetic dataset at: \hyperlink{https://github.com/Alexadlu/CENet}{\color{red} https://github.com/Alexadlu/CENet}.
CVDec 15, 2023Code
Structural Information Guided Multimodal Pre-training for Vehicle-centric PerceptionXiao Wang, Wentao Wu, Chenglong Li et al.
Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.
CVFeb 25
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-TrainingDengdi Sun, Xiaoya Zhou, Xiao Wang et al.
Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.
IVJan 7, 2025Code
Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report GenerationXiao Wang, Fuling Wang, Haowen Wang et al.
X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
IVApr 27, 2024Code
Pre-training on High Definition X-ray Images: An Experimental StudyXiao Wang, Yuehang Li, Wentao Wu et al.
Existing X-ray based pre-trained vision models are usually conducted on a relatively small-scale dataset (less than 500k samples) with limited resolution (e.g., 224 $\times$ 224). However, the key to the success of self-supervised pre-training large models lies in massive training data, and maintaining high resolution in the field of X-ray images is the guarantee of effective solutions to difficult miscellaneous diseases. In this paper, we address these issues by proposing the first high-definition (1280 $\times$ 1280) X-ray based pre-trained foundation vision model on our newly collected large-scale dataset which contains more than 1 million X-ray images. Our model follows the masked auto-encoder framework which takes the tokens after mask processing (with a high rate) is used as input, and the masked image patches are reconstructed by the Transformer encoder-decoder network. More importantly, we introduce a novel context-aware masking strategy that utilizes the chest contour as a boundary for adaptive masking operations. We validate the effectiveness of our model on two downstream tasks, including X-ray report generation and disease recognition. Extensive experiments demonstrate that our pre-trained medical foundation vision model achieves comparable or even new state-of-the-art performance on downstream benchmark datasets. The source code and pre-trained models of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
69.6CVMay 17
DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied TrackingGuyue Hu, Haoming Liu, Siyuan Song et al.
Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.
CVMar 18, 2025Code
LEGNet: A Lightweight Edge-Gaussian Network for Low-Quality Remote Sensing Image Object DetectionWei Lu, Si-Bao Chen, Hui-Dong Li et al.
Remote sensing object detection (RSOD) often suffers from degradations such as low spatial resolution, sensor noise, motion blur, and adverse illumination. These factors diminish feature distinctiveness, leading to ambiguous object representations and inadequate foreground-background separation. Existing RSOD methods exhibit limitations in robust detection of low-quality objects. To address these pressing challenges, we introduce LEGNet, a lightweight backbone network featuring a novel Edge-Gaussian Aggregation (EGA) module specifically engineered to enhance feature representation derived from low-quality remote sensing images. EGA module integrates: (a) orientation-aware Scharr filters to sharpen crucial edge details often lost in low-contrast or blurred objects, and (b) Gaussian-prior-based feature refinement to suppress noise and regularize ambiguous feature responses, enhancing foreground saliency under challenging conditions. EGA module alleviates prevalent problems in reduced contrast, structural discontinuities, and ambiguous feature responses prevalent in degraded images, effectively improving model robustness while maintaining computational efficiency. Comprehensive evaluations across five benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0, and VisDrone2019) demonstrate that LEGNet achieves state-of-the-art performance, particularly in detecting low-quality objects.The code is available at https://github.com/AeroVILab-AHU/LEGNet.
CVDec 18, 2023Code
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video RecognitionXiao Wang, Yao Rong, Shiao Wang et al.
Pattern recognition based on RGB-Event data is a newly arising research topic and previous works usually learn their features using CNN or Transformer. As we know, CNN captures the local features well and the cascaded self-attention mechanisms are good at extracting the long-range global relations. It is intuitive to combine them for high-performance RGB-Event based video recognition, however, existing works fail to achieve a good balance between the accuracy and model parameters, as shown in Fig.~\ref{firstimage}. In this work, we propose a novel RGB-Event based recognition framework termed TSCFormer, which is a relatively lightweight CNN-Transformer model. Specifically, we mainly adopt the CNN as the backbone network to first encode both RGB and Event data. Meanwhile, we initialize global tokens as the input and fuse them with RGB and Event features using the BridgeFormer module. It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture at the same time. The enhanced features will be projected and fused into the RGB and Event CNN blocks, respectively, in an interactive manner using F2E and F2V modules. Similar operations are conducted for other CNN blocks to achieve adaptive fusion and local-global feature enhancement under different resolutions. Finally, we concatenate these three features and feed them into the classification head for pattern recognition. Extensive experiments on two large-scale RGB-Event benchmark datasets (PokerEvent and HARDVS) fully validated the effectiveness of our proposed TSCFormer. The source code and pre-trained models will be released at https://github.com/Event-AHU/TSCFormer.
CVMar 23, 2025Code
Real-World Remote Sensing Image Dehazing: Benchmark and BaselineZeng-Hui Zhu, Wei Lu, Si-Bao Chen et al.
Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at https://github.com/lwCVer/RRSHID.