BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
This work addresses the need for up-to-date evaluation of discourse understanding in LLMs, which is crucial for researchers and developers, but it is incremental as it compiles existing tasks and adds some novel challenges.
The authors tackled the problem of evaluating discourse-level knowledge in modern large language models (LLMs) by introducing BeDiscovER, a comprehensive benchmark with 52 datasets across 5 tasks, and found that state-of-the-art models perform well in temporal reasoning but struggle with document reasoning and subtle semantic phenomena like rhetorical relation recognition.
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.