Explicit Spatial Encoding for Deep Local Descriptors
This work addresses the need for more robust and efficient local descriptors in computer vision, particularly for tasks like image matching, but it is incremental as it builds on existing kernel and neural network methods.
The paper tackles the problem of improving deep local patch descriptors by incorporating explicit spatial encoding with kernelized match kernels, achieving consistent outperformance over all other methods on standard benchmarks for both 32x32 and 64x64 patch sizes.
We propose a kernelized deep local-patch descriptor based on efficient match kernels of neural network activations. Response of each receptive field is encoded together with its spatial location using explicit feature maps. Two location parametrizations, Cartesian and polar, are used to provide robustness to a different types of canonical patch misalignment. Additionally, we analyze how the conventional architecture, i.e. a fully connected layer attached after the convolutional part, encodes responses in a spatially variant way. In contrary, explicit spatial encoding is used in our descriptor, whose potential applications are not limited to local-patches. We evaluate the descriptor on standard benchmarks. Both versions, encoding 32x32 or 64x64 patches, consistently outperform all other methods on all benchmarks. The number of parameters of the model is independent of the input patch resolution.