CV AIAug 22, 2022

Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding

Stephen Su, Samuel Kwong, Qingyu Zhao, De-An Huang, Juan Carlos Niebles, Ehsan Adeli

SalesforceStanford

arXiv:2208.10077v11.4h-index: 64

Originality Incremental advance

AI Analysis

This work addresses video understanding for researchers by challenging the assumption that all tasks should be optimized in multi-task learning, though it is incremental as it builds on existing multi-task methods.

The paper tackles the problem of multi-task video learning by distinguishing between auxiliary tasks that help and adversarial tasks that harm performance, using Necessary Condition Analysis to identify adversarial tasks like scene recognition in the HVU dataset, and shows that penalizing these tasks improves action recognition accuracy by about 3% on challenging test splits.

There has been an increasing interest in multi-task learning for video understanding in recent years. In this work, we propose a generalized notion of multi-task learning by incorporating both auxiliary tasks that the model should perform well on and adversarial tasks that the model should not perform well on. We employ Necessary Condition Analysis (NCA) as a data-driven approach for deciding what category these tasks should fall in. Our novel proposed framework, Adversarial Multi-Task Neural Networks (AMT), penalizes adversarial tasks, determined by NCA to be scene recognition in the Holistic Video Understanding (HVU) dataset, to improve action recognition. This upends the common assumption that the model should always be encouraged to do well on all tasks in multi-task learning. Simultaneously, AMT still retains all the benefits of multi-task learning as a generalization of existing methods and uses object recognition as an auxiliary task to aid action recognition. We introduce two challenging Scene-Invariant test splits of HVU, where the model is evaluated on action-scene co-occurrences not encountered in training. We show that our approach improves accuracy by ~3% and encourages the model to attend to action features instead of correlation-biasing scene features.

View on arXiv PDF

Similar