CLJan 26, 2024

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, Jeff Z. Pan

arXiv:2401.14640v29.115 citationsHas CodeACL

Originality Incremental advance

AI Analysis

This work addresses the problem of evaluating complex attributions in question answering for researchers, providing an automated benchmark to reduce reliance on manual annotations and enable fine-grained comparisons.

The paper tackles the limitations in evaluating attributions in Attributed Question Answering by introducing Complex Attributed Question Answering (CAQA), a large-scale benchmark with comprehensive categories and complex scenarios automatically generated using Knowledge Graphs, and verifies its effectiveness through experiments benchmarking 25 automatic evaluators and comparing them with human evaluators.

Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are publicly accessible at https://github.com/HuuuNan/CAQA-Benchmark.

View on arXiv PDF Code

Similar