CL LGOct 17, 2023

Alexpaca: Learning Factual Clarification Question Generation Without Examples

Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano

arXiv:2310.11571v31.32 citationsh-index: 51

Originality Incremental advance

AI Analysis

This addresses the challenge of eliciting missing factual information in real-life collaborative tasks like legal or technical advice, though it is incremental as it builds on existing clarification question generation work.

The paper tackles the problem of generating factual clarification questions in multi-hop reasoning tasks, introducing a new benchmark called HotpotQA-FLM for automatic evaluation, and shows that fine-tuning Llama 3 8B Instruct with filtered self-generations improves information recovery by 27.6%.

Real-life tasks such as giving legal or technical advice often lack complete context at the outset and can have disparate answers depending thereon. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Existing factual clarification question challenges evaluate generations based on word overlap or human evaluations. Recent work explores generating a response to the clarifying question then evaluating its utility directly. So far, these tasks are limited to disambiguating the user's intent rather than concrete facts about the situation. The factual domain presents unique challenges since responses to clarification questions must be factually true for accurate evaluation. To enable evaluation of factual domain clarification question generation, We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. The task, HotpotQA-FLM, can be evaluated automatically, making it convenient for benchmarking language models. We observe that humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find by fine-tuning Llama 3 8B Instruct on its own generations, filtered via rejection sampling, we can improve information recovery by 27.6 percent.

View on arXiv PDF

Similar