CVAILGNov 19, 2021

Grounded Situation Recognition with Transformers

arXiv:2111.10135v127 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding complex visual scenes for applications in computer vision, representing an incremental improvement by applying Transformers to a specific domain.

The paper tackles the task of Grounded Situation Recognition (GSR), which involves classifying actions and predicting entities with their locations in images, by proposing a Transformer-based model that achieves state-of-the-art results on the SWiG benchmark.

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark. Our code is available at https://github.com/jhcho99/gsrtr .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes