TS-Net: Combining modality specific and common features for multimodal patch matching
This addresses the problem of finding correspondences between image patches from different modalities, such as RGB vs. sketch, for applications in computer vision, but it is incremental as it builds on existing Siamese-like approaches.
The paper tackles multimodal patch matching by proposing TS-Net, a three-stream architecture that combines modality-specific and common features, achieving significant performance gains over Siamese and Pseudo-Siamese networks on three datasets.
Multimodal patch matching addresses the problem of finding the correspondences between image patches from two different modalities, e.g. RGB vs sketch or RGB vs near-infrared. The comparison of patches of different modalities can be done by discovering the information common to both modalities (Siamese like approaches) or the modality-specific information (Pseudo-Siamese like approaches). We observed that none of these two scenarios is optimal. This motivates us to propose a three-stream architecture, dubbed as TS-Net, combining the benefits of the two. In addition, we show that adding extra constraints in the intermediate layers of such networks further boosts the performance. Experimentations on three multimodal datasets show significant performance gains in comparison with Siamese and Pseudo-Siamese networks.