CVJan 5, 2021

Trear: Transformer-based RGB-D Egocentric Action Recognition

arXiv:2101.03904v1104 citations
Originality Highly original
AI Analysis

This work addresses the problem of egocentric action recognition for computer vision researchers, demonstrating significant improvements over existing methods.

This paper introduces Trear, a Transformer-based framework for egocentric action recognition using RGB-D data. It models temporal structure with self-attention and fuses multi-modal features, achieving state-of-the-art performance on THU-READ, FPHA, and WCVS datasets.

In this paper, we propose a \textbf{Tr}ansformer-based RGB-D \textbf{e}gocentric \textbf{a}ction \textbf{r}ecognition framework, called Trear. It consists of two modules, inter-frame attention encoder and mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D datasets, THU-READ and FPHA, and one small dataset, WCVS, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes