What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
This work identifies limitations in BERT's semantic modeling for NLP researchers, highlighting issues that could impact downstream tasks, but it is incremental as it builds on existing critiques of contextual embeddings.
The paper assessed BERT's contextualized embeddings as a distributional semantics model, finding that while they show some semantic coherence, they are disrupted by meaningless positional effects from sentence structure, affecting similarity relationships.
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.