CVAug 5, 2024

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

arXiv:2408.02245v2h-index: 16
AI Analysis

This work addresses semantic segmentation in RGB-D data, showing incremental gains over existing methods.

The paper tackles image understanding for RGB-D datasets by proposing a two-stage progressive pre-training method using multi-modal contrastive masked autoencoders and denoising, achieving a +1.3% mIoU improvement over Mask3D on ScanNet semantic segmentation.

In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.3% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes