Generic Event Boundary Detection via Denoising Diffusion
This addresses the problem of subjective event segmentation in videos for computer vision applications, offering a novel generative approach that is incremental over prior deterministic methods.
The paper tackles generic event boundary detection in videos by proposing a diffusion-based generative model, DiffGEBD, which generates diverse plausible boundaries instead of deterministic predictions, achieving strong performance on Kinetics-GEBD and TAPOS benchmarks.
Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.