CVAICLLGMar 24, 2021

VLGrammar: Grounded Grammar Induction of Vision and Language

arXiv:2103.12975v128 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of learning hierarchical structures in both visual and linguistic domains for AI systems, representing an incremental advance in multimodal representation learning.

The authors tackled the problem of grounded grammar induction for both vision and language by proposing VLGrammar, a method using compound probabilistic context-free grammars with contrastive learning, which outperformed baselines on the PartIt dataset and improved image clustering accuracy by 30%.

Cognitive grammar suggests that the acquisition of language grammar is grounded within visual structures. While grammar is an essential representation of natural language, it also exists ubiquitously in vision to represent the hierarchical part-whole structure. In this work, we study grounded grammar induction of vision and language in a joint learning framework. Specifically, we present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously. We propose a novel contrastive learning framework to guide the joint learning of both modules. To provide a benchmark for the grounded grammar induction task, we collect a large-scale dataset, \textsc{PartIt}, which contains human-written sentences that describe part-level semantics for 3D objects. Experiments on the \textsc{PartIt} dataset show that VLGrammar outperforms all baselines in image grammar induction and language grammar induction. The learned VLGrammar naturally benefits related downstream tasks. Specifically, it improves the image unsupervised clustering accuracy by 30\%, and performs well in image retrieval and text retrieval. Notably, the induced grammar shows superior generalizability by easily generalizing to unseen categories.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes