CLSep 15, 2023

Investigating Answerability of LLMs for Long-Form Question Answering

Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, Semih Yavuz

arXiv:2309.08210v13.915 citationsh-index: 23Has Code

Originality Incremental advance

AI Analysis

This work addresses the understudied challenge of long-form question answering for practical applications like troubleshooting and customer service, but it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating LLMs on long-form question answering by proposing a method to generate challenging questions from abstractive summaries, revealing performance gaps between massive LLMs like ChatGPT and open-source models such as Alpaca and Llama, with open-source models showing decreased context reliance and significant drops in generation capabilities for longer contexts (>1024 tokens).

As we embark on a new era of LLMs, it becomes increasingly crucial to understand their capabilities, limitations, and differences. Toward making further progress in this direction, we strive to build a deeper understanding of the gaps between massive LLMs (e.g., ChatGPT) and smaller yet effective open-source LLMs and their distilled counterparts. To this end, we specifically focus on long-form question answering (LFQA) because it has several practical and impactful applications (e.g., troubleshooting, customer service, etc.) yet is still understudied and challenging for LLMs. We propose a question-generation method from abstractive summaries and show that generating follow-up questions from summaries of long documents can create a challenging setting for LLMs to reason and infer from long contexts. Our experimental results confirm that: (1) our proposed method of generating questions from abstractive summaries pose a challenging setup for LLMs and shows performance gaps between LLMs like ChatGPT and open-source LLMs (Alpaca, Llama) (2) open-source LLMs exhibit decreased reliance on context for generated questions from the original document, but their generation capabilities drop significantly on generated questions from summaries -- especially for longer contexts (>1024 tokens)

View on arXiv PDF

Similar