Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
This addresses the problem of evaluating AI summarization for targeted communication goals, but it is incremental as it focuses on applying existing models to specific datasets.
This study investigated whether GPT models can generate dialogue summaries that follow human guidelines, finding that human evaluators preferred GPT-generated summaries over those from task-specific models and references, despite occasional longer outputs and structural differences.
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.