CVJan 29, 2019

Visual Rhythm Prediction with Feature-Aligning Network

arXiv:1901.10163v15 citations
Originality Incremental advance
AI Analysis

This addresses the problem of predicting visual rhythms in videos for applications like video editing, though it is incremental as it builds on existing sequence labeling techniques.

The paper tackles visual rhythm prediction by replacing human-crafted rules with a data-driven method using a feature-aligning network, achieving an F1 score of 79.6 on a custom dataset of music videos.

In this paper, we propose a data-driven visual rhythm prediction method, which overcomes the previous works' deficiency that predictions are made primarily by human-crafted hard rules. In our approach, we first extract features including original frames and their residuals, optical flow, scene change, and body pose. These visual features will be next taken into an end-to-end neural network as inputs. Here we observe that there are some slight misaligning between features over the timeline and assume that this is due to the distinctions between how different features are computed. To solve this problem, the extracted features are aligned by an elaborately designed layer, which can also be applied to other models suffering from mismatched features, and boost performance. Then these aligned features are fed into sequence labeling layers implemented with BiLSTM and CRF to predict the onsets. Due to the lack of existing public training and evaluation set, we experiment on a dataset constructed by ourselves based on professionally edited Music Videos (MVs), and the F1 score of our approach reaches 79.6.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes