CLJul 28, 2023

'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

Javier Chiyah-Garcia, Alessandro Suglia, Arash Eshghi, Helen Hastie

arXiv:2307.15554v124.8193 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of referential ambiguity in dialogue systems for AI applications, but it is incremental as it builds on existing datasets and models.

The study evaluated how well multi-modal dialogue models handle clarificational exchanges to resolve referential ambiguities, finding that language-based models excel with dialogue history-related clarifications while multi-modal models with disentangled object representations better manage complex cross-modal ambiguities.

Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.

View on arXiv PDF Code

Similar