CLOct 20, 2024

Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

CMU
arXiv:2410.15512v216 citationsh-index: 27NAACL
Originality Incremental advance
AI Analysis

This work addresses benchmark design and reasoning consistency for LLMs, but it is incremental as it builds on existing QA tasks.

The study tackled the problem of reverse question answering (RQA), where LLMs generate questions for given answers, and found that LLMs are much less accurate in RQA for numerical answers but slightly more accurate for textual answers compared to standard QA, with errors not solely due to knowledge gaps.

Question answering (QA), giving correct answers to questions, is a popular task, but we test reverse question answering (RQA): for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to provide valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes