CVSep 22, 2024

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Jidong Kuang, Hongsong Wang, Chaolei Han, Yang Zhang, Jie Gui

arXiv:2409.14336v211.311 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the scalability and generalization problem in action recognition for computer vision applications, though it appears incremental as it builds on existing alignment methods.

The paper tackles zero-shot skeleton-based action recognition by introducing Dual Visual-Text Alignment (DVTA) to better align visual features with semantic text vectors, achieving state-of-the-art results on multiple benchmarks.

Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules--Direct Alignment (DA) and Augmented Alignment (AA)--along with a designed Semantic Description Enhancement (SDE). The DA module maps the skeleton features to the semantic space through a specially designed visual projector, followed by the SDE, which is based on cross-attention to enhance the connection between skeleton and text, thereby reducing the gap between modalities. The AA module further strengthens the learning of the embedding space by utilizing deep metric learning to learn the similarity between skeleton and text. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks. The code is available at: https://github.com/jidongkuang/DVTA.

View on arXiv PDF Code

Similar