Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction
This work addresses the problem of accurate and efficient facial beauty prediction for computer vision applications, presenting an incremental improvement over existing methods.
The paper tackled the trade-off between efficiency and global context modeling in facial attractiveness prediction by proposing Mamba-CNN, a hybrid architecture that integrates a Mamba-inspired SSM gating mechanism into a CNN backbone, achieving a Pearson Correlation of 0.9187 and MAE of 0.2022 on the SCUT-FBP5500 benchmark.
The computational assessment of facial attractiveness, a challenging subjective regression task, is dominated by architectures with a critical trade-off: Convolutional Neural Networks (CNNs) offer efficiency but have limited receptive fields, while Vision Transformers (ViTs) model global context at a quadratic computational cost. To address this, we propose Mamba-CNN, a novel and efficient hybrid architecture. Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone. This core innovation allows the network to dynamically modulate feature maps and selectively emphasize salient facial features and their long-range spatial relationships, mirroring human holistic perception while maintaining computational efficiency. We conducted extensive experiments on the widely-used SCUT-FBP5500 benchmark, where our model sets a new state-of-the-art. Mamba-CNN achieves a Pearson Correlation (PC) of 0.9187, a Mean Absolute Error (MAE) of 0.2022, and a Root Mean Square Error (RMSE) of 0.2610. Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.