Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models
This addresses the need for engineers and legislative bodies to debug and audit models before deployment, though it is incremental as it builds on existing post-hoc analysis techniques.
The paper tackles the problem of characterizing and mitigating failure modes in large-scale vision and language models, such as accuracy issues and social biases, by introducing a post-hoc deep reinforcement learning method that explores failure landscapes and restructures them with limited human feedback, showing effectiveness across multiple tasks.
In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we introduce a post-hoc method that utilizes \emph{deep reinforcement learning} to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.