CL LGJan 14, 2020

Balancing the composition of word embeddings across heterogenous data sets

Stephanie Brandl, David Lassner, Maximilian Alber

arXiv:2001.04693v10.2

Originality Incremental advance

AI Analysis

This addresses bias in word embeddings for NLP applications, but it is incremental as it builds on existing embedding methods.

The paper tackled the problem of biased word embeddings due to heterogeneous data subsets by proposing criteria to measure subset influence and developing balancing approaches, finding that a weighted average balances influence but reduces similarity performance, and suggesting an optimization method to maintain both.

Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to align the influence of single subsets on the resulting word vectors, while retaining their quality. In this regard we propose a criteria to measure the shift towards a single data subset and develop approaches to meet both objectives. We find that a weighted average of the two subset embeddings balances the influence of those subsets while word similarity performance decreases. We further propose a promising optimization approach to balance influences and quality of word embeddings.

View on arXiv PDF

Similar