CL AIOct 2, 2025

MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour

arXiv:2510.01659v19.63 citationsh-index: 16EMNLP

Originality Incremental advance

AI Analysis

This work addresses the need for reliable evaluation in MDS, which is critical for applications, but it is incremental as it builds on existing evaluation concepts by adapting them to a multimodal domain.

The authors tackled the lack of robust automatic evaluation methods for Multimodal Dialogue Summarization (MDS) by introducing MDSEval, the first meta-evaluation benchmark with human annotations across eight quality aspects, and benchmarked state-of-the-art methods to reveal their limitations and biases.

Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

View on arXiv PDF

Similar