Oops! Predicting Unintentional Action in Video
This work addresses the challenge of understanding human intentionality in video analysis, which is incremental as it builds on existing action recognition methods with a new dataset and tasks.
The researchers tackled the problem of predicting unintentional actions in videos by introducing a dataset and tasks for recognition, localization, and anticipation, and found that a self-supervised approach using video speed performed competitively with supervised methods, but a significant gap between machine and human performance remained.
From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains. The project website is available at https://oops.cs.columbia.edu