Gap-weighted subsequences for automatic cognate identification and phylogenetic inference
This work addresses a domain-specific problem in linguistics for researchers studying language evolution and phylogenetics, with incremental improvements in feature design.
The paper tackles the problem of cognate identification by introducing subsequence-based features, which outperform state-of-the-art string similarity measures, and uses these cognate judgments for phylogenetic inference, resulting in a tree close to the gold standard.
In this paper, we describe the problem of cognate identification and its relation to phylogenetic inference. We introduce subsequence based features for discriminating cognates from non-cognates. We show that subsequence based features perform better than the state-of-the-art string similarity measures for the purpose of cognate identification. We use the cognate judgments for the purpose of phylogenetic inference and observe that these classifiers infer a tree which is close to the gold standard tree. The contribution of this paper is the use of subsequence features for cognate identification and to employ the cognate judgments for phylogenetic inference.