CLOct 19, 2025

DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

arXiv:2510.17013v36.72 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses the need for more challenging and multilingual benchmarks in NLP for researchers and developers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of multilingual benchmarks for discourse tracking in LLMs by introducing DiscoTrack, a benchmark across 12 languages and four discourse levels, and found that state-of-the-art models still struggle with these tasks.

Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often targeting information from individual sentences. We are still lacking more challenging, and importantly also multilingual, benchmarks focusing on implicit information and pragmatic inferences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark targeting a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

View on arXiv PDF

Similar