CLSep 24, 2025

Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking

Sujoy Sarkar, Gourav Sarkar, Manoj Balaji Jagadeeshan, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal

arXiv:2509.19844v1h-index: 8

Originality Synthesis-oriented

AI Analysis

This provides a benchmark for advancing entity resolution in literary domains, particularly for under-resourced languages like Sanskrit, though it is incremental as it focuses on dataset creation.

The authors tackled entity resolution in literary texts by creating Mahānāma, the first large-scale dataset for Entity Discovery and Linking in Sanskrit, derived from the Mahābhārata with over 109K mentions and 5.5K unique entities, and found that current models struggle in this complex context.

High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata, the world's longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.

View on arXiv PDF

Similar