Minutiae-Guided Fingerprint Embeddings via Vision Transformers
This work addresses the problem of efficient and accurate fingerprint recognition for security and identification applications, representing an incremental improvement by combining CNNs and ViTs.
The paper tackles fingerprint recognition by proposing the first use of a Vision Transformer to learn fixed-length embeddings, achieving a TAR of 94.23% at FAR=0.1% on the NIST SD 302 dataset, which is close to a commercial SOTA matcher at 96.71%, while enabling much faster matching speeds of 2.5 million matches per second.
Minutiae matching has long dominated the field of fingerprint recognition. However, deep networks can be used to extract fixed-length embeddings from fingerprints. To date, the few studies that have explored the use of CNN architectures to extract such embeddings have shown extreme promise. Inspired by these early works, we propose the first use of a Vision Transformer (ViT) to learn a discriminative fixed-length fingerprint embedding. We further demonstrate that by guiding the ViT to focus in on local, minutiae related features, we can boost the recognition performance. Finally, we show that by fusing embeddings learned by CNNs and ViTs we can reach near parity with a commercial state-of-the-art (SOTA) matcher. In particular, we obtain a TAR=94.23% @ FAR=0.1% on the NIST SD 302 public-domain dataset, compared to a SOTA commercial matcher which obtains TAR=96.71% @ FAR=0.1%. Additionally, our fixed-length embeddings can be matched orders of magnitude faster than the commercial system (2.5 million matches/second compared to 50K matches/second). We make our code and models publicly available to encourage further research on this topic: https://github.com/tba.