Robustness Evaluation for Video Models with Reinforcement Learning
This work addresses robustness evaluation for video models, which is critical for applications like action recognition, but it is incremental as it builds on existing adversarial attack methods with a novel multi-agent approach.
The paper tackles the challenge of evaluating robustness in video classification models by proposing a multi-agent reinforcement learning approach that identifies sensitive spatial and temporal regions to generate fine, imperceptible perturbations, outperforming state-of-the-art solutions on Lp metric and average queries.
Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.