APCLMay 10, 2024

Sampling the Swadesh List to Identify Similar Languages with Tree Spaces

arXiv:2405.06549v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses language classification for linguistics researchers, but it appears incremental as it applies existing tree space methods to a specific dataset.

The paper tackles the problem of identifying language relationships by analyzing Swadesh list data using tree spaces and clustering, finding that sample means can be sticky or non-sticky, which helps infer common or different ancestry among languages.

Communication plays a vital role in human interaction. Studying language is a worthwhile task and more recently has become quantitative in nature with developments of fields like quantitative comparative linguistics and lexicostatistics. With respect to the authors own native languages, the ancestry of the English language and the Latin alphabet are of the primary interest. The Indo-European Tree traces many modern languages back to the Proto-Indo-European root. Swadesh's cognates played a large role in developing that historical perspective where some of the primary branches are Germanic, Celtic, Italic, and Balto-Slavic. This paper will use data analysis on open books where the simplest singular space is the 3-spider - a union T3 of three rays with their endpoints glued at a point 0 - which can represent these tree spaces for language clustering. These trees are built using a single linkage method for clustering based on distances between samples from languages which use the Latin Script. Taking three languages at a time, the barycenter is determined. Some initial results have found both non-sticky and sticky sample means. If the mean exhibits non-sticky properties, then one language may come from a different ancestor than the other two. If the mean is considered sticky, then the languages may share a common ancestor or all languages may have different ancestry.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes