Path-Invariant Map Networks
This addresses the under-explored issue of self-supervision in directed map networks for computer vision and related fields, offering a novel constraint that improves efficiency and reduces data requirements, though it is incremental in extending consistency constraints from undirected to directed networks.
The paper tackles the problem of optimizing directed map networks by introducing a path-invariance constraint, which enforces consistency across different paths between domains, and demonstrates its effectiveness in tasks like 3D semantic segmentation, where it achieves the same performance with only 8% labeled data compared to 30-100% for baseline methods.
Optimizing a network of maps among a collection of objects/domains (or map synchronization) is a central problem across computer vision and many other relevant fields. Compared to optimizing pairwise maps in isolation, the benefit of map synchronization is that there are natural constraints among a map network that can improve the quality of individual maps. While such self-supervision constraints are well-understood for undirected map networks (e.g., the cycle-consistency constraint), they are under-explored for directed map networks, which naturally arise when maps are given by parametric maps (e.g., a feed-forward neural network). In this paper, we study a natural self-supervision constraint for directed map networks called path-invariance, which enforces that composite maps along different paths between a fixed pair of source and target domains are identical. We introduce path-invariance bases for efficient encoding of the path-invariance constraint and present an algorithm that outputs a path-variance basis with polynomial time and space complexities. We demonstrate the effectiveness of our approach on optimizing object correspondences, estimating dense image maps via neural networks, and semantic segmentation of 3D scenes via map networks of diverse 3D representations. In particular, for 3D semantic segmentation, our approach only requires 8% labeled data from ScanNet to achieve the same performance as training a single 3D segmentation network with 30% to 100% labeled data.