Spatio-Temporal Action Graph Networks
This work addresses activity recognition for scenarios like driving where critical events are rare and object interactions are key, offering a more efficient learning approach compared to global descriptors.
The paper tackles the problem of recognizing events involving object interactions in scenes with limited labeled examples by proposing a novel inter-object graph representation with disentangled spatial and temporal embeddings. The model demonstrates significantly improved performance on the Charades benchmark and a new driving dataset with near-collision events.
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance. Activity recognition models that represent object interactions explicitly have the potential to learn in a more efficient manner than those that represent scenes with global descriptors. We propose a novel inter-object graph representation for activity recognition based on a disentangled graph embedding with direct observation of edge appearance. We employ a novel factored embedding of the graph structure, disentangling a representation hierarchy formed over spatial dimensions from that found over temporal variation. We demonstrate the effectiveness of our model on the Charades activity recognition benchmark, as well as a new dataset of driving activities focusing on multi-object interactions with near-collision events. Our model offers significantly improved performance compared to baseline approaches without object-graph representations, or with previous graph-based models.