Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition
This work addresses the problem of evaluating and integrating foundation models for face recognition, offering insights for researchers and practitioners in computer vision, though it is incremental in comparing existing methods.
The paper compared generic foundation models like CLIP and GPT-4o against domain-specific face recognition models such as AdaFace, finding that domain-specific models outperformed foundation models in all benchmarks, but fusion improved accuracy at low false match rates and foundation models provided explainability.
In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, GPT-4o, Grok-4) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we report the following findings: (a) In all face benchmark datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improved on over-segmented face images compared to tightly cropped faces, thereby suggesting the importance of contextual clues. (c) A simple score-level fusion of a foundation model with a domain-specific face recognition model improved the accuracy at low false match rates. (d) Foundation models, such as GPT-4o and Grok-4, are able to provide explainability to the face recognition pipeline. In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace, thereby reiterating the importance of combining domain-specific face recognition models with generic foundation models in a judicious manner.