CVAug 28, 2024

DEAR: Depth-Enhanced Action Recognition

arXiv:2408.15679v21 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of action recognition in complex scenes for computer vision applications, but it is incremental as it builds on existing frameworks like Side4Video and VideoMamba.

The paper tackles action recognition in cluttered videos by integrating 3D depth maps with RGB features, achieving improved accuracy on the Something-Something V2 dataset compared to a baseline implementation.

Detecting actions in videos, particularly within cluttered scenes, poses significant challenges due to the limitations of 2D frame analysis from a camera perspective. Unlike human vision, which benefits from 3D understanding, recognizing actions in such environments can be difficult. This research introduces a novel approach integrating 3D features and depth maps alongside RGB features to enhance action recognition accuracy. Our method involves processing estimated depth maps through a separate branch from the RGB feature encoder and fusing the features to understand the scene and actions comprehensively. Using the Side4Video framework and VideoMamba, which employ CLIP and VisionMamba for spatial feature extraction, our approach outperformed our implementation of the Side4Video network on the Something-Something V2 dataset. Our code is available at: https://github.com/SadeghRahmaniB/DEAR

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes