CLJan 30, 2023

Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts

arXiv:2301.12969v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses a specific problem in computational linguistics for analyzing Sanskrit and related texts, offering an incremental improvement over existing methods.

The paper tackles the difficulty of applying n-gram models to Sanskrit texts due to complex word segmentation by proposing n-aksaras, a simpler tokenization method using sequences of aksaras, which reduces the need for sandhi resolution and enables analysis of Sanskrit-adjacent texts, as demonstrated by modeling commentaries on Amarakosa 1.0.1 to show patterns of text reuse across ten centuries and nine languages.

Despite -- or perhaps because of -- their simplicity, n-grams, or contiguous sequences of tokens, have been used with great success in computational linguistics since their introduction in the late 20th century. Recast as k-mers, or contiguous sequences of monomers, they have also found applications in computational biology. When applied to the analysis of texts, n-grams usually take the form of sequences of words. But if we try to apply this model to the analysis of Sanskrit texts, we are faced with the arduous task of, firstly, resolving sandhi to split a phrase into words, and, secondly, splitting long compounds into their components. This paper presents a simpler method of tokenizing a Sanskrit text for n-grams, by using n-aksaras, or contiguous sequences of aksaras. This model reduces the need for sandhi resolution, making it much easier to use on raw text. It is also possible to use this model on Sanskrit-adjacent texts, e.g., a Tamil commentary on a Sanskrit text. As a test case, the commentaries on Amarakosa 1.0.1 have been modelled as n-aksaras, showing patterns of text reuse across ten centuries and nine languages. Some initial observations are made concerning Buddhist commentarial practices.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes