OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
This work addresses the need for better datasets and models for open-vocabulary repetition counting in videos, which is incremental as it builds on existing video analysis tasks.
The authors introduced OVR, a large-scale dataset for temporal repetition counting in videos with over 72K annotations, and proposed OVRCounter, a transformer-based model that localizes and counts repetitions, showing performance improvements over prior methods.
We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/