CLCVSep 20, 2023

The Scenario Refiner: Grounding subjects in images at the morphological level

arXiv:2309.11252v1104 citationsh-index: 30
Originality Synthesis-oriented
AI Analysis

This addresses the problem of model-human alignment in nuanced language understanding for researchers in V&L, though it is incremental as it builds on existing methodologies.

The paper investigates whether Vision and Language models capture semantic distinctions at the morphological level, such as between 'runner' and 'running', and finds that model predictions differ from human judgments, showing a grammatical bias.

Derivationally related words, such as "runner" and "running", exhibit semantic differences which also elicit different visual scenarios. In this paper, we ask whether Vision and Language (V\&L) models capture such distinctions at the morphological level, using a a new methodology and dataset. We compare the results from V\&L models to human judgements and find that models' predictions differ from those of human participants, in particular displaying a grammatical bias. We further investigate whether the human-model misalignment is related to model architecture. Our methodology, developed on one specific morphological contrast, can be further extended for testing models on capturing other nuanced language features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes