Mix and match networks: encoder-decoder alignment for zero-pair image translation
This addresses the problem of enabling image translation across domains without paired data for researchers and practitioners in computer vision, though it appears incremental as it builds on existing encoder-decoder and alignment techniques.
The paper tackles zero-pair image translation, where no direct paired data exists between domains or modalities, by proposing mix and match networks that align encoders and decoders to enable unseen translations at test time. It shows the approach outperforms baselines like pix2pix and CycleGAN in tasks such as colorization, style transfer, and estimating semantic segmentation from depth images without paired training data.
We address the problem of image translation between domains or modalities for which no direct paired data is available (i.e. zero-pair translation). We propose mix and match networks, based on multiple encoders and decoders aligned in such a way that other encoder-decoder pairs can be composed at test time to perform unseen image translation tasks between domains or modalities for which explicit paired samples were not seen during training. We study the impact of autoencoders, side information and losses in improving the alignment and transferability of trained pairwise translation models to unseen translations. We show our approach is scalable and can perform colorization and style transfer between unseen combinations of domains. We evaluate our system in a challenging cross-modal setting where semantic segmentation is estimated from depth images, without explicit access to any depth-semantic segmentation training pairs. Our model outperforms baselines based on pix2pix and CycleGAN models.