Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings
This addresses the costly and slow process of creating or updating wordnets, particularly for low-resource languages, though it is incremental as it builds on existing embedding techniques.
The study tackled the problem of automatically constructing a wordnet for low-resource languages like Filipino by proposing a method for word sense and synset induction using only an unlabeled corpus and sentence embeddings, resulting in 30% valid induced word senses and 40% valid induced synsets, with 20% being novel.
Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.