CVAug 12, 2021

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

arXiv:2108.05863v127 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal reasoning for researchers in computer vision and AI, though it is incremental as it builds on existing 3D reconstruction methods by adding language data.

The authors tackled the problem of underutilizing language data in 3D vision by introducing WikiScenes, a large-scale dataset combining images, text, and 3D geometry for landmarks, and developed a weakly-supervised framework to associate semantic concepts with image pixels and 3D points.

The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections---namely language, e.g., from image captions---has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics---utilizing the strong constraints provided by 3D geometry---to associate semantic concepts to image pixels and 3D points.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes