CVNov 6, 2023

Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers

arXiv:2311.02803v18 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses face identification challenges in real-world scenarios with occlusions, offering a faster and interpretable solution, though it is incremental as it builds on existing methods like DeepFace-EMD.

The paper tackles the problem of face identification for out-of-distribution data, such as occluded faces, by proposing a 2-image Vision Transformer that uses cross-attention for patch-level comparison, achieving comparable accuracy to the state-of-the-art DeepFace-EMD while being more than twice as fast in inference speed.

Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this technique can be subject to occlusion (e.g. faces with masks or sunglasses) and out-of-distribution data. DeepFace-EMD (Phan et al. 2022) reaches state-of-the-art accuracy on out-of-distribution data by first comparing two images at the image level, and then at the patch level. Yet, its later patch-wise re-ranking stage admits a large $O(n^3 \log n)$ time complexity (for $n$ patches in an image) due to the optimal transport optimization. In this paper, we propose a novel, 2-image Vision Transformers (ViTs) that compares two images at the patch level using cross-attention. After training on 2M pairs of images on CASIA Webface (Yi et al. 2014), our model performs at a comparable accuracy as DeepFace-EMD on out-of-distribution data, yet at an inference speed more than twice as fast as DeepFace-EMD (Phan et al. 2022). In addition, via a human study, our model shows promising explainability through the visualization of cross-attention. We believe our work can inspire more explorations in using ViTs for face identification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes