CVAug 24, 2017

Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

arXiv:1708.07335v10.9Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses video-level representation for real-fake expression prediction, an incremental improvement in a domain-specific task.

The paper tackles the problem of aggregating frame-level visual features for video-level representation in real-fake expression prediction by introducing a learnable aggregation technique that retains short-time temporal structure and spatial interdependencies, achieving 65% MAP score on a test dataset, close to the best reported result of 66.7%.

Frame-level visual features are generally aggregated in time with the techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust video-level representation. We here introduce a learnable aggregation technique whose primary objective is to retain short-time temporal structure between frame-level features and their spatial interdependencies in the representation. Also, it can be easily adapted to the cases where there have very scarce training samples. We evaluate the method on a real-fake expression prediction dataset to demonstrate its superiority. Our method obtains 65% score on the test dataset in the official MAP evaluation and there is only one misclassified decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) . Lastly, we believe that this method can be extended to different problems such as action/event recognition in future.

View on arXiv PDF Code

Similar