CV AI LGMay 13, 2021

Using Self-Supervised Auxiliary Tasks to Improve Fine-Grained Facial Representation

Mahdi Pourmirzaei, Gholam Ali Montazer, Farzaneh Esmaili

arXiv:2105.06421v435 citations

Originality Incremental advance

AI Analysis

This addresses fine-grained facial analysis problems like emotion recognition, offering a simple method to boost performance without inference-time costs.

The paper tackles facial emotion recognition by showing that training from scratch with strong augmentation matches ImageNet fine-tuning, then proposes Hybrid Multi-Task Learning (HMTL) that adds self-supervised auxiliary tasks during training to improve performance. On AffectNet, HMTL achieves state-of-the-art accuracy for eight emotions without extra data, with larger gains in low-data settings.

Facial emotion recognition (FER) is a fine-grained problem where the value of transfer learning is often assumed. We first quantify this assumption and show that, on AffectNet, training from random initialization with sufficiently strong augmentation consistently matches or surpasses fine-tuning from ImageNet. Motivated by this result, we propose Hybrid Multi-Task Learning (HMTL) for FER in the wild. HMTL augments supervised learning (SL) with self-supervised learning (SSL) objectives during training, while keeping the inference-time model unchanged. We instantiate HMTL with two tailored pretext tasks, puzzling and inpainting with a perceptual loss, that encourage part-aware and expression-relevant features. On AffectNet, both HMTL variants achieve state-of-the-art accuracy in the eight-emotion setting without any additional pretraining data, and they provide larger gains under low-data regimes. Compared with conventional SSL pretraining, HMTL yields stronger downstream performance. Beyond FER, the same strategy improves fine-grained facial analysis tasks, including head pose estimation and gender recognition. These results suggest that aligned SSL auxiliaries are an effective and simple way to strengthen supervised fine-grained facial representation without adding extra computation cost during inference time.

View on arXiv PDF

Similar