CVApr 6, 2024

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Tao Wu, Runyu He, Gangshan Wu, Limin Wang

arXiv:2404.04565v114.118 citationsh-index: 10Has CodeCVPR

Originality Synthesis-oriented

AI Analysis

This addresses a gap in video understanding for applications like sports analysis, though it is incremental as it builds on existing visual relation detection tasks.

The authors tackled the lack of datasets for detecting complex human-human interactions in videos by introducing SportsHHI, a dataset with 34 high-level interaction classes from sports, containing 50,649 annotated instances across 11,398 keyframes, and they proposed a baseline method to benchmark it.

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

View on arXiv PDF Code

Similar