Local Feature Matching with Transformers for low-end devices
This work addresses the challenge of deploying efficient computer vision models on resource-constrained devices, representing an incremental optimization of an existing method.
The paper tackles the problem of adapting the LoFTR local feature matching method for low-end devices by reducing parameters and using knowledge distillation, achieving comparable accuracy to the original model in coarse matching despite significant size reduction.
LoFTR arXiv:2104.00680 is an efficient deep learning method for finding appropriate local feature matches on image pairs. This paper reports on the optimization of this method to work on devices with low computational performance and limited memory. The original LoFTR approach is based on a ResNet arXiv:1512.03385 head and two modules based on Linear Transformer arXiv:2006.04768 architecture. In the presented work, only the coarse-matching block was left, the number of parameters was significantly reduced, and the network was trained using a knowledge distillation technique. The comparison showed that this approach allows to obtain an appropriate feature detection accuracy for the student model compared to the teacher model in the coarse matching block, despite the significant reduction of model size. Also, the paper shows additional steps required to make model compatible with NVIDIA TensorRT runtime, and shows an approach to optimize training method for low-end GPUs.