CVAIAug 31, 2020

Online Spatiotemporal Action Detection and Prediction via Causal Representations

arXiv:2008.13759v1
Originality Incremental advance
AI Analysis

This work addresses real-time video action understanding for applications requiring immediate processing, though it appears incremental as it builds on existing methods.

The thesis tackled online spatiotemporal action detection and prediction by converting offline pipelines to real-time systems and extending action tubes for future regression, achieving performance comparable to offline 3D CNNs on tasks like action recognition and early prediction.

In this thesis, we focus on video action understanding problems from an online and real-time processing point of view. We start with the conversion of the traditional offline spatiotemporal action detection pipeline into an online spatiotemporal action tube detection system. An action tube is a set of bounding connected over time, which bounds an action instance in space and time. Next, we explore the future prediction capabilities of such detection methods by extending an existing action tube into the future by regression. Later, we seek to establish that online/causal representations can achieve similar performance to that of offline three dimensional (3D) convolutional neural networks (CNNs) on various tasks, including action recognition, temporal action segmentation and early prediction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes