Debiasing Convolutional Neural Networks via Meta Orthogonalization
This addresses fairness issues in deep learning for applications where models might use sensitive attributes, though it appears incremental by building on existing debiasing techniques.
The paper tackles the problem of debiasing convolutional neural networks to reduce reliance on spurious correlations like protected attributes, and demonstrates that their Meta Orthogonalization method significantly mitigates bias while maintaining competitive task performance.
While deep learning models often achieve strong task performance, their successes are hampered by their inability to disentangle spurious correlations from causative factors, such as when they use protected attributes (e.g., race, gender, etc.) to make decisions. In this work, we tackle the problem of debiasing convolutional neural networks (CNNs) in such instances. Building off of existing work on debiasing word embeddings and model interpretability, our Meta Orthogonalization method encourages the CNN representations of different concepts (e.g., gender and class labels) to be orthogonal to one another in activation space while maintaining strong downstream task performance. Through a variety of experiments, we systematically test our method and demonstrate that it significantly mitigates model bias and is competitive against current adversarial debiasing methods.