SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space
This work addresses the need for interpretable facial action estimation in applications like avatar control and human-computer interaction, representing an incremental advancement.
The paper tackles the problem of estimating interpretable facial actions from single images by predicting ARKit blendshape coefficients, achieving improved accuracy and perceptual consistency through semantic distillation.
Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.