CVMay 3, 2023

Visual Transformation Telling

arXiv:2305.01928v25 citations
Originality Incremental advance
AI Analysis

It addresses the problem of enabling AI to reason about underlying causes of visual changes, which is incremental as it builds on existing video datasets and tasks.

The paper introduces Visual Transformation Telling (VTT), a new visual reasoning task that requires describing transformations between adjacent state images, and tests it with a dataset of 13,547 samples, finding that state-of-the-art models struggle with this challenge.

Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, called \textbf{V}isual \textbf{T}ransformation \textbf{T}elling (VTT). Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets, CrossTask and COIN, comprising 13,547 samples. Each sample involves the key state images along with their transformation descriptions. Our dataset covers diverse real-world activities, providing a rich resource for training and evaluation. To construct an initial benchmark for VTT, we test several models, including traditional visual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o, and GPT-4). Experimental results reveal that even state-of-the-art models still face challenges in VTT, highlighting substantial areas for improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes