CLNov 13, 2023

Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models

arXiv:2311.07194v331 citationsh-index: 35
Originality Incremental advance
AI Analysis

This work addresses the challenge of assessing dialogue comprehension in LLMs, which is crucial for improving their reliability in conversational AI applications, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating factual consistency in dialogue comprehension for large language models (LLMs) by using dialogue summarization and derived factual questions, finding that on average 26.8% of summaries contain factual inconsistencies and 36.1% error rate in question-answering, with ChatGPT having 16% errors in summaries. It proposes a fine-tuning paradigm that reduces the error rate by 11% on the question-answering task.

LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-QA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-QA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes