CLLGSIJun 30, 2023

Multi-Dialectal Representation Learning of Sinitic Phonology

arXiv:2307.01209v1222 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of comparing dialects and reconstructing proto-languages in Sinitic historical phonology, offering a tool for linguists, but it is incremental as it applies existing methods to a new domain.

The paper tackled the problem of representing Sinitic phonology across multiple dialects by constructing a knowledge graph and applying BoxE to learn syllable representations, which captured phonemic contrasts and enabled inference of Middle Chinese labels with potential for completing fragmented knowledge bases.

Machine learning techniques have shown their competence for representing and reasoning in symbolic systems such as language and phonology. In Sinitic Historical Phonology, notable tasks that could benefit from machine learning include the comparison of dialects and reconstruction of proto-languages systems. Motivated by this, this paper provides an approach for obtaining multi-dialectal representations of Sinitic syllables, by constructing a knowledge graph from structured phonological data, then applying the BoxE technique from knowledge base learning. We applied unsupervised clustering techniques to the obtained representations to observe that the representations capture phonemic contrast from the input dialects. Furthermore, we trained classifiers to perform inference of unobserved Middle Chinese labels, showing the representations' potential for indicating archaic, proto-language features. The representations can be used for performing completion of fragmented Sinitic phonological knowledge bases, estimating divergences between different characters, or aiding the exploration and reconstruction of archaic features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes