CLApr 5, 2021

WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

arXiv:2104.01767v3684 citations
AI Analysis

This work provides practical improvements for unsupervised sentence embeddings in natural language matching and retrieval, though it is incremental in nature.

The paper tackled unsupervised sentence embedding by analyzing pretrained models and found that averaging all tokens, combining top and bottom layers, and applying a simple whitening-based normalization strategy consistently improves performance across seven datasets.

Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes