CLApr 4, 2024

Evaluating Generative Language Models in Information Extraction as Subjective Question Correction

Yuchen Fan, Yantao Liu, Zijun Yao, Jifan Yu, Lei Hou, Juanzi Li

arXiv:2404.03532v124.284 citationsh-index: 30Has CodeLREC

Originality Incremental advance

AI Analysis

This addresses evaluation challenges for researchers and practitioners in NLP, offering a more accurate assessment of LLMs in information extraction, though it is incremental as it refines existing evaluation approaches rather than introducing a new task or model.

The paper tackles the problem of evaluating large language models (LLMs) in information extraction tasks, where conventional metrics underestimate performance due to imprecision and incomplete benchmarks, by proposing SQC-Score, a new evaluation method that uses fine-tuned LLMs and NLI models to improve matching and enrich labels, resulting in it being more preferred by human annotators than baseline metrics on three tasks.

Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at https://github.com/THU-KEG/SQC-Score.

View on arXiv PDF Code

Similar