Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos
This work addresses the understudied issue of noisy labels in video classification, offering a domain-specific solution that improves performance on tasks like Mini-Kinetics and Sth-Sth-V1.
The paper tackles the problem of learning from mislabeled videos by proposing a framework that includes channel truncation for noise detection and noise contrastive learning for regularization, achieving over 0.4 F1-score in noise detection and up to 5% accuracy improvement on benchmark datasets under severe noise conditions.
Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.