Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models
This work addresses the problem of understanding syntactic processing in MLLMs for computational linguistics and AI research, offering a new dataset and metric, but it is incremental in scope.
The paper tackled whether multimodal large language models (MLLMs) exhibit structural priming effects similar to humans, and found that fusion-encoded models show robust positive correlations between priming effects and visual similarity, aligning more closely with human cognitive patterns.
Structural priming is a cognitive phenomenon where exposure to a particular syntactic structure increases the likelihood of producing the same structure in subsequent utterances. While humans consistently demonstrate structural priming effects across various linguistic contexts, it remains unclear whether multimodal large language models (MLLMs) exhibit similar syntactic preservation behaviors. We introduce PRISMATIC, the first multimodal structural priming dataset, which advances computational linguistics by providing a standardized benchmark for investigating syntax-vision interactions. We propose the Syntactic Preservation Index (SPI), a novel reference-free evaluation metric designed specifically to assess structural priming effects in sentence level. Using this metric, we constructed and tested models with two different multimodal encoding architectures to investigate their structural preservation capabilities. Our experimental results demonstrate that models with both encoding methods show comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.