Polish phonology and morphology through the lens of distributional semantics
This research addresses the problem of understanding the relationship between linguistic form and meaning for linguists and computational linguists, though it is incremental as it applies existing methods to a specific language.
This study investigates whether the phonological and morphological structure of Polish words, particularly those with consonant clusters, is reflected in their semantic representations using distributional semantics. It demonstrates that semantic vectors can predict phonotactic complexity, morphotactic transparency, and various morphosyntactic categories without form information, and that computational models based on these embeddings achieve high accuracy in predicting comprehension and production.
This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that -- apart from encoding rich morphosyntactic information (e.g. tense, number, case) -- semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.