CVOct 17, 2025

VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction

arXiv:2510.16220v13.61 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses the problem of accurately modeling subjective human aesthetic perception in facial images for computer vision applications, presenting an incremental improvement through a novel ensemble architecture.

The paper tackled facial beauty prediction by introducing VM-BeautyNet, a heterogeneous ensemble combining Vision Transformer and Mamba models, achieving state-of-the-art results with a Pearson Correlation of 0.9212, MAE of 0.2085, and RMSE of 0.2698 on the SCUT-FBP5500 dataset.

Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.

View on arXiv PDF

Similar