CVMay 13, 2025Code
Towards Adaptive Meta-Gradient Adversarial Examples for Visual TrackingWei-Long Tian, Peng Gao, Xiao Liu et al.
In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reveal the security vulnerabilities of existing visual trackers through effective adversarial attacks has become a critical problem that needs to be addressed. To this end, we propose an adaptive meta-gradient adversarial attack (AMGA) method for visual tracking. This method integrates multi-model ensembles and meta-learning strategies, combining momentum mechanisms and Gaussian smoothing, which can significantly enhance the transferability and attack effectiveness of adversarial examples. AMGA randomly selects models from a large model repository, constructs diverse tracking scenarios, and iteratively performs both white- and black-box adversarial attacks in each scenario, optimizing the gradient directions of each model. This paradigm minimizes the gap between white- and black-box adversarial attacks, thus achieving excellent attack performance in black-box scenarios. Extensive experimental results on large-scale datasets such as OTB2015, LaSOT, and GOT-10k demonstrate that AMGA significantly improves the attack performance, transferability, and deception of adversarial examples. Codes and data are available at https://github.com/pgao-lab/AMGA.
CVFeb 27, 2024
In Defense and Revival of Bayesian Filtering for Thermal Infrared Object TrackingPeng Gao, Shi-Min Li, Feng Gao et al.
Deep learning-based methods monopolize the latest research in the field of thermal infrared (TIR) object tracking. However, relying solely on deep learning models to obtain better tracking results requires carefully selecting feature information that is beneficial to representing the target object and designing a reasonable template update strategy, which undoubtedly increases the difficulty of model design. Thus, recent TIR tracking methods face many challenges in complex scenarios. This paper introduces a novel Deep Bayesian Filtering (DBF) method to enhance TIR tracking in these challenging situations. DBF is distinctive in its dual-model structure: the system and observation models. The system model leverages motion data to estimate the potential positions of the target object based on two-dimensional Brownian motion, thus generating a prior probability. Following this, the observation model comes into play upon capturing the TIR image. It serves as a classifier and employs infrared information to ascertain the likelihood of these estimated positions, creating a likelihood probability. According to the guidance of the two models, the position of the target object can be determined, and the template can be dynamically updated. Experimental analysis across several benchmark datasets reveals that DBF achieves competitive performance, surpassing most existing TIR tracking methods in complex scenarios.
LGApr 27, 2024
Critical Review for One-class Classification: recent advances and the reality behind themToshitaka Hayashi, Dalibor Cimr, Hamido Fujita et al.
This paper offers a comprehensive review of one-class classification (OCC), examining the technologies and methodologies employed in its implementation. It delves into various approaches utilized for OCC across diverse data types, such as feature data, image, video, time series, and others. Through a systematic review, this paper synthesizes promi-nent strategies used in OCC from its inception to its current advance-ments, with a particular emphasis on the promising application. Moreo-ver, the article criticizes the state-of-the-art (SOTA) image anomaly de-tection (AD) algorithms dominating one-class experiments. These algo-rithms include outlier exposure (binary classification) and pretrained model (multi-class classification), conflicting with the fundamental con-cept of learning from one class. Our investigation reveals that the top nine algorithms for one-class CIFAR10 benchmark are not OCC. We ar-gue that binary/multi-class classification algorithms should not be com-pared with OCC.
LGDec 12, 2024
Multi-view Clustering via Unified Multi-kernel Learning and Matrix FactorizationChenxing Jia, Mingjie Cai, Hamido Fujita
Multi-view clustering has become increasingly important due to the multi-source character of real-world data. Among existing multi-view clustering methods, multi-kernel clustering and matrix factorization-based multi-view clustering have gained widespread attention as mainstream approaches. However, multi-kernel clustering tends to learn an optimal kernel and then perform eigenvalue decomposition on it, which leads to high computational complexity. Matrix factorization-based multi-view clustering methods impose orthogonal constraints on individual views. This overly emphasizes the accuracy of clustering structures within single views and restricts the learning of individual views. Based on this analysis, we propose a multi-view clustering method that integrates multi-kernel learning with matrix factorization. This approach combines the advantages of both multi-kernel learning and matrix factorization. It removes the orthogonal constraints on individual views and imposes orthogonal constraints on the consensus matrix, resulting in an accurate final clustering structure. Ultimately, the method is unified into a simple form of multi-kernel clustering, but avoids learning an optimal kernel, thus reducing the time complexity. Furthermore, we propose an efficient three-step optimization algorithm to achieve a locally optimal solution. Experiments on widely-used real-world datasets demonstrate the effectiveness of our proposed method.
CVJan 31, 2025
Project-and-Fuse: Improving RGB-D Semantic Segmentation via Graph Convolution NetworksXiaoyan Jiang, Bohan Wang, Xinlong Wan et al.
Most existing RGB-D semantic segmentation methods focus on the feature level fusion, including complex cross-modality and cross-scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter-intuitive patches in the segmentation results. Inspired by the popular pixel-node-pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface tendencies.At projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback-Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU-DepthV2 and SUN RGB-D, have shown that our approach can consistently boost the performance of RGB-D semantic segmentation task.
CLJul 2, 2021
ClueReader: Heterogeneous Graph Attention Network for Multi-hop Machine Reading ComprehensionPeng Gao, Feng Gao, Peng Wang et al.
Multi-hop machine reading comprehension is a challenging task in natural language processing as it requires more reasoning ability across multiple documents. Spectral models based on graph convolutional networks have shown good inferring abilities and lead to competitive results. However, the analysis and reasoning of some are inconsistent with those of humans. Inspired by the concept of grandmother cells in cognitive neuroscience, we propose a heterogeneous graph attention network model named ClueReader to imitate the grandmother cell concept. The model is designed to assemble the semantic features in multi-level representations and automatically concentrate or alleviate information for reasoning through the attention mechanism. The name ClueReader is a metaphor for the pattern of the model: it regards the subjects of queries as the starting points of clues, takes the reasoning entities as bridge points, considers the latent candidate entities as grandmother cells, and the clues end up in candidate entities. The proposed model enables the visualization of the reasoning graph, making it possible to analyze the importance of edges connecting entities and the selectivity in the mention and candidate nodes, which is easier to comprehend empirically. Evaluations on the open-domain multi-hop reading dataset WikiHop and drug-drug interaction dataset MedHop proved the validity of ClueReader and showed the feasibility of its application of the model in the molecular biology domain.
LGJan 16, 2020
Cluster-based Zero-shot learning for multivariate dataToshitaka Hayashi, Hamido Fujita
Supervised learning requires a sufficient training dataset which includes all label. However, there are cases that some class is not in the training data. Zero-Shot Learning (ZSL) is the task of predicting class that is not in the training data(target class). The existing ZSL method is done for image data. However, the zero-shot problem should happen to every data type. Hence, considering ZSL for other data types is required. In this paper, we propose the cluster-based ZSL method, which is a baseline method for multivariate binary classification problems. The proposed method is based on the assumption that if data is far from training data, the data is considered as target class. In training, clustering is done for training data. In prediction, the data is determined belonging to a cluster or not. If data does not belong to a cluster, the data is predicted as target class. The proposed method is evaluated and demonstrated using the KEEL dataset. This paper has been published in the Journal of Ambient Intelligence and Humanized Computing. The final version is available at the following URL: https://link.springer.com/article/10.1007/s12652-020-02268-5
CVAug 27, 2019
Learning Reinforced Attentional Representation for End-to-End Visual TrackingPeng Gao, Qiquan Zhang, Fei Wang et al.
Although numerous recent tracking approaches have made tremendous advances in the last decade, achieving high-performance visual tracking remains a challenge. In this paper, we propose an end-to-end network model to learn reinforced attentional representation for accurate target object discrimination and localization. We utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attention to effectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentional correlation filter into the backbone network to make our model trainable in an end-to-end fashion. Our proposed approach not only takes full advantage of informative geometries and semantics but also updates correlation filters online without fine-tuning the backbone network to enable the adaptation of variations in the target object's appearance. Extensive experiments conducted on several popular benchmark datasets demonstrate that our proposed approach is effective and computationally efficient.
CVApr 23, 2019
Siamese Attentional Keypoint Network for High Performance Visual TrackingPeng Gao, Ruyue Yuan, Fei Wang et al.
In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.