CVAILGMar 13, 2025

A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection

arXiv:2503.10371v11 citationsh-index: 1PAKDD
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of automating labor-intensive and subjective clinical assessments for facial palsy detection, representing an incremental improvement with domain-specific application.

The paper tackled facial palsy detection by developing a multimodal fusion model combining MLP mixer and feed-forward neural networks, achieving a 96.00 F1 score, which outperformed single-modality baselines.

Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessments by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes an MLP mixer-based model to process unstructured data (i.e. RGB images or images with facial line segments) and a feed-forward neural network to process structured data (i.e. facial landmark coordinates, features of facial expressions, or handcrafted features) for detecting facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 20 facial palsy patients and 20 healthy subjects. Our multimodal fusion model achieved 96.00 F1, which is significantly higher than the feed-forward neural network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based model trained on raw RGB images (89.00 F1).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes