CLSep 27, 2024

Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations

Nicolò Penzo, Maryam Sajedinia, Bruno Lepri, Sara Tonelli, Marco Guerini

arXiv:2409.18602v114.425 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of diagnosing model weaknesses in multi-party conversations for researchers, but it is incremental as it builds on existing evaluation methods with a new diagnostic approach.

The paper tackled the challenge of evaluating models in Multi-Party Conversations by proposing a diagnostic pipeline to assess performance on Response Selection and Addressee Recognition tasks, revealing that response selection depends more on textual content while addressee recognition relies on structural aspects, with an LLM in zero-shot showing task-dependent sensitivity to prompts.

Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose alternatives to using original text messages. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension. Using an LLM in a zero-shot setting, we further highlight how sensitivity to prompt variations is task-dependent.

View on arXiv PDF Code

Similar