CVJun 6, 2021

Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions

arXiv:2106.03158v219 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of zero-shot learning for procedural actions in robotics, though it appears incremental as it builds on existing text-to-video transfer methods.

The paper tackles the problem of enabling robots to recognize and predict unseen activities by transferring knowledge from text to video, using a hierarchical model that generalizes instructional knowledge from text corpora to anticipate actions multiple steps into the future in natural language, demonstrated on a dataset of 4022 recipes.

Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes