DialogSum Challenge: Results of the Dialogue Summarization Shared Task
This work highlights the difficulty of summarizing real-life dialogues and the need for better evaluation metrics, but it is incremental as it builds on existing shared task frameworks.
The DialogSum Challenge tackled dialogue summarization by evaluating four teams' systems, showing significant improvements in automatic metrics like ROUGE scores over baselines but revealing a notable gap between model outputs and human summaries in human evaluation.
We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.