Dynamic Adaptive Threshold based Learning for Noisy Annotations Robust Facial Expression Recognition
This work addresses noisy annotations in facial expression recognition, which is a domain-specific problem, and presents an incremental improvement by combining supervised and unsupervised training techniques.
The authors tackled the problem of noisy annotations in facial expression recognition datasets by proposing a dynamic learning framework that selects clean samples using class-specific thresholds and uses unsupervised consistency training. Their method achieved robust performance on both synthetic and real noisy datasets, including RAFDB, FERPlus, SFEW, and AffectNet, without requiring prior knowledge of noise rates or clean data.
The real-world facial expression recognition (FER) datasets suffer from noisy annotations due to crowd-sourcing, ambiguity in expressions, the subjectivity of annotators and inter-class similarity. However, the recent deep networks have strong capacity to memorize the noisy annotations leading to corrupted feature embedding and poor generalization. To handle noisy annotations, we propose a dynamic FER learning framework (DNFER) in which clean samples are selected based on dynamic class specific threshold during training. Specifically, DNFER is based on supervised training using selected clean samples and unsupervised consistent training using all the samples. During training, the mean posterior class probabilities of each mini-batch is used as dynamic class-specific threshold to select the clean samples for supervised training. This threshold is independent of noise rate and does not need any clean data unlike other methods. In addition, to learn from all samples, the posterior distributions between weakly-augmented image and strongly-augmented image are aligned using an unsupervised consistency loss. We demonstrate the robustness of DNFER on both synthetic as well as on real noisy annotated FER datasets like RAFDB, FERPlus, SFEW and AffectNet.