CVSep 20, 2024

Formula-Supervised Visual-Geometric Pre-training

Ryosuke Yamada, Kensho Hara, Hirokatsu Kataoka, Koshi Makihara, Nakamasa Inoue, Rio Yokota, Yutaka Satoh

arXiv:2409.13535v15.22 citationsh-index: 18

Originality Incremental advance

AI Analysis

This work addresses the challenge of modality separation in computer vision for researchers and practitioners, offering a novel synthetic approach that reduces reliance on real data, though it appears incremental in advancing pre-training techniques.

The paper tackles the problem of integrating images and point clouds for visual-geometric representation learning by introducing a synthetic pre-training method that generates aligned data from mathematical formulas, achieving superior performance over existing methods across six image and 3D object recognition tasks.

Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.

View on arXiv PDF

Similar