Step Differences in Instructional Video
This addresses the limitation of current language-based assistance that only handles single videos, providing incremental improvements for AR/VR technology in personalized assistance.
The paper tackles the problem of comparing user videos to reference how-to videos for AR/VR assistance by proposing a method that generates visual instruction tuning data from HowTo100M and trains a video-conditioned language model to reason across multiple videos, achieving state-of-the-art performance in identifying and ranking differences between video pairs.
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff