Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data
This work addresses the problem of limited resources for researchers in natural language processing focusing on online debate summarization, though it is incremental as it builds on existing extractive summarization methods.
The authors tackled the lack of annotated data for summarizing online debates by collecting and annotating debate data for automatic summarization, achieving inter-annotator agreements of 36% for Cohen's kappa and 48% for Krippendorff's alpha. They also implemented an extractive summarization system and discussed key features for this task.
Usage of online textual media is steadily increasing. Daily, more and more news stories, blog posts and scientific articles are added to the online volumes. These are all freely accessible and have been employed extensively in multiple research areas, e.g. automatic text summarization, information retrieval, information extraction, etc. Meanwhile, online debate forums have recently become popular, but have remained largely unexplored. For this reason, there are no sufficient resources of annotated debate data available for conducting research in this genre. In this paper, we collected and annotated debate data for an automatic summarization task. Similar to extractive gold standard summary generation our data contains sentences worthy to include into a summary. Five human annotators performed this task. Inter-annotator agreement, based on semantic similarity, is 36% for Cohen's kappa and 48% for Krippendorff's alpha. Moreover, we also implement an extractive summarization system for online debates and discuss prominent features for the task of summarizing online debate data automatically.