CV LGOct 31, 2025

Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation

Gaby Maroun, Salah Eddine Bekhouche, Fadi Dornaika

arXiv:2511.00123v11 citationsh-index: 36Computer Vision and Image Understanding

Originality Incremental advance

AI Analysis

This work addresses age estimation from facial images, an important computer vision task, but is incremental as it combines existing models rather than introducing fundamentally new approaches.

The authors tackled facial age estimation by integrating ConvNeXt and Vision Transformers into a hybrid architecture, achieving superior performance with reduced mean absolute error on benchmark datasets like MORPH II, CACD, and AFAD.

Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.

View on arXiv PDF

Similar