CVMay 23, 2021

Coarse to Fine Multi-Resolution Temporal Convolutional Network

arXiv:2105.10859v165 citations
Originality Highly original
AI Analysis

This addresses over-segmentation errors in temporal convolutional networks for video analysis, offering a more accurate and coherent solution for applications like action recognition.

The paper tackles the problem of sequence fragmentation in temporal video segmentation by proposing a coarse-to-fine temporal encoder-decoder with implicit multi-resolution ensembling, achieving state-of-the-art performance on three benchmarks without needing additional refinement modules.

Temporal convolutional networks (TCNs) are a commonly used architecture for temporal video segmentation. TCNs however, tend to suffer from over-segmentation errors and require additional refinement modules to ensure smoothness and temporal coherency. In this work, we propose a novel temporal encoder-decoder to tackle the problem of sequence fragmentation. In particular, the decoder follows a coarse-to-fine structure with an implicit ensemble of multiple temporal resolutions. The ensembling produces smoother segmentations that are more accurate and better-calibrated, bypassing the need for additional refinement modules. In addition, we enhance our training with a multi-resolution feature-augmentation strategy to promote robustness to varying temporal resolutions. Finally, to support our architecture and encourage further sequence coherency, we propose an action loss that penalizes misclassifications at the video level. Experiments show that our stand-alone architecture, together with our novel feature-augmentation strategy and new loss, outperforms the state-of-the-art on three temporal video segmentation benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes