CVAug 21, 2023

Joint learning of images and videos with a single Vision Transformer

arXiv:2308.10533v1h-index: 6
Originality Synthesis-oriented
AI Analysis

This addresses the need for unified models in computer vision, though it appears incremental as it adapts existing methods to a new task.

The authors tackled the problem of separate training for images and videos by proposing a method for joint learning using a single Vision Transformer, with experimental results on two image and two action recognition datasets.

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes