MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization
This provides a new benchmark for multi-person complex activity localization in daily life, addressing challenges like semantic complexity and long duration, but it is incremental as it builds on existing localization tasks.
The authors introduced MM-SEAL, a large-scale video dataset with 111.7k atomic actions and 17.7k complex activities, to tackle multi-person spatio-temporal action localization, showing that atomic action features improve complex activity localization and pretraining on MM-SEAL boosts performance on other benchmarks.
In this paper, we introduce a novel large-scale video dataset dubbed MM-SEAL for multi-person multi-grained spatio-temporal action localization among human daily life. We are the first to propose a new benchmark for multi-person spatio-temporal complex activity localization, where complex semantic and long duration bring new challenges to localization tasks. We observe that limited atomic actions can be combined into many complex activities. MM-SEAL provides both atomic action and complex activity annotations, producing 111.7k atomic actions spanning 172 action categories and 17.7k complex activities spanning 200 activity categories. We explore the relationship between atomic actions and complex activities, finding that atomic action features can improve the complex activity localization performance. Also, we propose a new network which generates temporal proposals and labels simultaneously, termed Faster-TAD. Finally, our evaluations show that visual features pretrained on MM-SEAL can improve the performance on other action localization benchmarks. We will release the dataset and the project code upon publication of the paper.