Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
This work addresses the problem of evaluating complex attributions in question answering for researchers, providing an automated benchmark to reduce reliance on manual annotations and enable fine-grained comparisons.
The paper tackles the limitations in evaluating attributions in Attributed Question Answering by introducing Complex Attributed Question Answering (CAQA), a large-scale benchmark with comprehensive categories and complex scenarios automatically generated using Knowledge Graphs, and verifies its effectiveness through experiments benchmarking 25 automatic evaluators and comparing them with human evaluators.
Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are publicly accessible at https://github.com/HuuuNan/CAQA-Benchmark.