PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification
This work addresses the disconnect between ear biometric morphological variation and transformer positional sensitivity, offering a potential improvement for authentication schemes, though it is incremental as it builds on existing ViT methods with a domain-specific preprocessing step.
The paper tackled the problem of vision transformers' performance degradation due to rectangular tokens incorporating irrelevant background information in ear verification, by introducing PaW-ViT, a preprocessing method that aligns token boundaries to ear features, resulting in enhanced robustness to shape, size, and pose variations across various ViT models.
The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.