comp-syn: Perceptually Grounded Word Embeddings with Color
This work addresses the limitation of purely textual word embeddings for natural language processing by incorporating perceptual grounding, offering an incremental improvement with practical tools for researchers and developers.
The paper tackles the problem of word embeddings ignoring sensory aspects of language by introducing comp-syn, a Python package that provides grounded word embeddings based on color distributions from Google Images, showing it predicts human judgments of word concreteness with greater accuracy than word2vec and performs comparably on metaphorical vs. literal classification tasks.
Popular approaches to natural language processing create word embeddings based on textual co-occurrence patterns, but often ignore embodied, sensory aspects of language. Here, we introduce the Python package comp-syn, which provides grounded word embeddings based on the perceptually uniform color distributions of Google Image search results. We demonstrate that comp-syn significantly enriches models of distributional semantics. In particular, we show that (1) comp-syn predicts human judgments of word concreteness with greater accuracy and in a more interpretable fashion than word2vec using low-dimensional word-color embeddings, and (2) comp-syn performs comparably to word2vec on a metaphorical vs. literal word-pair classification task. comp-syn is open-source on PyPi and is compatible with mainstream machine-learning Python packages. Our package release includes word-color embeddings for over 40,000 English words, each associated with crowd-sourced word concreteness judgments.