Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints
This work addresses the challenge of language-independent document clustering for academic texts in Russian and Ukrainian, offering an incremental improvement over existing methods.
The authors tackled the problem of clustering bilingual comparable corpora of Russian and Ukrainian academic texts by developing a method using word embeddings and semantic fingerprints, which outperformed baselines like orthographic translation by a large margin and required fewer linguistic resources.
We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic texts, for which topics are their academic fields. In order to build language-independent semantic representations of these documents, we train neural distributional models on monolingual corpora and learn the optimal linear transformation of vectors from one language to another. The resulting vectors are then used to produce `semantic fingerprints' of documents, serving as input to a clustering algorithm. The presented method is compared to several baselines including `orthographic translation' with Levenshtein edit distance and outperforms them by a large margin. We also show that language-independent `semantic fingerprints' are superior to multi-lingual clustering algorithms proposed in the previous work, at the same time requiring less linguistic resources.