Zero-Shot Activity Recognition with Verb Attribute Induction
This addresses the problem of recognizing unseen activities without labeled data for researchers in computer vision and AI, but it is incremental as it builds on prior zero-shot work by focusing on action attributes instead of object attributes.
The paper tackles zero-shot activity recognition by modeling visual and linguistic attributes of action verbs, such as 'salute' being a light movement, social act, and short in duration, and shows that inferred action attributes from language provide a predictive signal for unseen activities.
In this paper, we investigate large-scale zero-shot activity recognition by modeling the visual and linguistic attributes of action verbs. For example, the verb "salute" has several properties, such as being a light movement, a social act, and short in duration. We use these attributes as the internal mapping between visual and textual representations to reason about a previously unseen action. In contrast to much prior work that assumes access to gold standard attributes for zero-shot classes and focuses primarily on object attributes, our model uniquely learns to infer action attributes from dictionary definitions and distributed word representations. Experimental results confirm that action attributes inferred from language can provide a predictive signal for zero-shot prediction of previously unseen activities.