Fast and unsupervised methods for multilingual cognate clustering
This provides a fast and accurate tool for historical linguists to identify cognates in less-studied language families.
The paper tackled the problem of detecting cognates in multilingual word lists using unsupervised methods, and found that an online PMI system outperformed HMM-based and linguistically motivated systems across 16 language groups.
In this paper we explore the use of unsupervised methods for detecting cognates in multilingual word lists. We use online EM to train sound segment similarity weights for computing similarity between two words. We tested our online systems on geographically spread sixteen different language groups of the world and show that the Online PMI system (Pointwise Mutual Information) outperforms a HMM based system and two linguistically motivated systems: LexStat and ALINE. Our results suggest that a PMI system trained in an online fashion can be used by historical linguists for fast and accurate identification of cognates in not so well-studied language families.