CLLGMay 15, 2023

Semantic Composition in Visually Grounded Language Models

arXiv:2305.16328v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enabling AI models to better understand and generate language grounded in visual contexts, which is incremental as it builds on existing vision-language research with new techniques.

The paper tackled the problem of semantic composition in visually grounded language models, which often fail to represent compositional structure, by introducing new benchmarks and methods like WinogroundVQA and Syntactic Neural Module Distillation, resulting in improved compositional ability as measured by these tools.

What is sentence meaning and its ideal representation? Much of the expressive power of human language derives from semantic composition, the mind's ability to represent meaning hierarchically & relationally over constituents. At the same time, much sentential meaning is outside the text and requires grounding in sensory, motor, and experiential modalities to be adequately learned. Although large language models display considerable compositional ability, recent work shows that visually-grounded language models drastically fail to represent compositional structure. In this thesis, we explore whether & how models compose visually grounded semantics, and how we might improve their ability to do so. Specifically, we introduce 1) WinogroundVQA, a new compositional visual question answering benchmark, 2) Syntactic Neural Module Distillation, a measure of compositional ability in sentence embedding models, 3) Causal Tracing for Image Captioning Models to locate neural representations vital for vision-language composition, 4) Syntactic MeanPool to inject a compositional inductive bias into sentence embeddings, and 5) Cross-modal Attention Congruence Regularization, a self-supervised objective function for vision-language relation alignment. We close by discussing connections of our work to neuroscience, psycholinguistics, formal semantics, and philosophy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes