LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
This work addresses the challenge of label efficiency in image harmonization for computer vision applications, offering an incremental improvement over existing methods.
The paper tackles the problem of image harmonization by proposing a self-supervised pre-training method called LEMaRT, which generates perturbed images for training without extensive annotations, and introduces a model called SwinIH that achieves state-of-the-art results, outperforming previous methods by 0.4 dB with 50% training data and 1.0 dB with full data on the iHarmony4 dataset.
We present a simple yet effective self-supervised pre-training method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various visual attributes, e.g., defocus blur, contrast, saturation, of the region specified by the generated mask. We then pre-train image harmonization models by recovering the original image from the perturbed image. Secondly, we introduce an image harmonization model, namely SwinIH, by retrofitting the Swin Transformer [27] with a combination of local and global self-attention mechanisms. Pre-training SwinIH with LEMaRT results in a new state of the art for image harmonization, while being label-efficient, i.e., consuming less annotated data for fine-tuning than existing methods. Notably, on iHarmony4 dataset [8], SwinIH outperforms the state of the art, i.e., SCS-Co [16] by a margin of 0.4 dB when it is fine-tuned on only 50% of the training data, and by 1.0 dB when it is trained on the full training dataset.