CVApr 8, 2023Code
Attack-Augmentation Mixing-Contrastive Skeletal Representation LearningBinqian Xu, Xiangbo Shu, Jiachao Zhang et al. · tencent-ai
Contrastive learning, relying on effective positive and negative sample pairs, is beneficial to learn informative skeleton representations in unsupervised skeleton-based action recognition. To achieve these positive and negative pairs, existing weak/strong data augmentation methods have to randomly change the appearance of skeletons for indirectly pursuing semantic perturbations. However, such approaches have two limitations: i) solely perturbing appearance cannot well capture the intrinsic semantic information of skeletons, and ii) randomly perturbation may change the original positive/negative pairs to soft positive/negative ones. To address the above dilemma, we start the first attempt to explore an attack-based augmentation scheme that additionally brings in direct semantic perturbation, for constructing hard positive pairs and further assisting in constructing hard negative pairs. In particular, we propose a novel Attack-Augmentation Mixing-Contrastive skeletal representation learning (A$^2$MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations. In A$^2$MC, Attack-Augmentation (Att-Aug) is designed to collaboratively perform targeted and untargeted perturbations of skeletons via attack and augmentation respectively, for generating high-quality hard positive features. Meanwhile, Positive-Negative Mixer (PNM) is presented to mix hard positive features and negative features for generating hard negative features, which are adopted for updating the mixed memory banks. Extensive experiments on three public datasets demonstrate that A$^2$MC is competitive with the state-of-the-art methods. The code will be accessible on A$^2$MC (https://github.com/1xbq1/A2MC).
CVNov 27, 2024
EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and BeyondMeiqi Cao, Xiangbo Shu, Jiachao Zhang et al.
Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.
CVApr 14, 2025
Hierarchical Relation-augmented Representation Generalization for Few-shot Action RecognitionHongyu Qu, Ling Xing, Jiachao Zhang et al.
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.
CVJan 18, 2024
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action RecognitionGuangzhao Dai, Xiangbo Shu, Wenhao Wu et al.
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).
CVJul 6, 2018
From Rank Estimation to Rank Approximation: Rank Residual Constraint for Image RestorationZhiyuan Zha, Xin Yuan, Bihan Wen et al.
In this paper, we propose a novel approach to the rank minimization problem, termed rank residual constraint (RRC) model. Different from existing low-rank based approaches, such as the well-known nuclear norm minimization (NNM) and the weighted nuclear norm minimization (WNNM), which estimate the underlying low-rank matrix directly from the corrupted observations, we progressively approximate the underlying low-rank matrix via minimizing the rank residual. Through integrating the image nonlocal self-similarity (NSS) prior with the proposed RRC model, we apply it to image restoration tasks, including image denoising and image compression artifacts reduction. Towards this end, we first obtain a good reference of the original image groups by using the image NSS prior, and then the rank residual of the image groups between this reference and the degraded image is minimized to achieve a better estimate to the desired image. In this manner, both the reference and the estimated image are updated gradually and jointly in each iteration. Based on the group-based sparse representation model, we further provide a theoretical analysis on the feasibility of the proposed RRC model. Experimental results demonstrate that the proposed RRC model outperforms many state-of-the-art schemes in both the objective and perceptual quality.
CVSep 12, 2017
A Benchmark for Sparse Coding: When Group Sparsity Meets Rank MinimizationZhiyuan Zha, Xin Yuan, Bihan Wen et al.
Sparse coding has achieved a great success in various image processing tasks. However, a benchmark to measure the sparsity of image patch/group is missing since sparse coding is essentially an NP-hard problem. This work attempts to fill the gap from the perspective of rank minimization. More details please see the manuscript....
CVAug 16, 2016
A Comparative Study for the Nuclear Norms Minimization MethodsZhiyuan Zha, Bihan Wen, Jiachao Zhang et al.
The nuclear norm minimization (NNM) is commonly used to approximate the matrix rank by shrinking all singular values equally. However, the singular values have clear physical meanings in many practical problems, and NNM may not be able to faithfully approximate the matrix rank. To alleviate the above-mentioned limitation of NNM, recent studies have suggested that the weighted nuclear norm minimization (WNNM) can achieve a better rank estimation than NNM, which heuristically set the weight being inverse to the singular values. However, it still lacks a rigorous explanation why WNNM is more effective than NMM in various applications. In this paper, we analyze NNM and WNNM from the perspective of group sparse representation (GSR). Concretely, an adaptive dictionary learning method is devised to connect the rank minimization and GSR models. Based on the proposed dictionary, we prove that NNM and WNNM are equivalent to L1-norm minimization and the weighted L1-norm minimization in GSR, respectively. Inspired by enhancing sparsity of the weighted L1-norm minimization in comparison with L1-norm minimization in sparse representation, we thus explain that WNNM is more effective than NMM. By integrating the image nonlocal self-similarity (NSS) prior with the WNNM model, we then apply it to solve the image denoising problem. Experimental results demonstrate that WNNM is more effective than NNM and outperforms several state-of-the-art methods in both objective and perceptual quality.