WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
This work provides practical improvements for unsupervised sentence embeddings in natural language matching and retrieval, though it is incremental in nature.
The paper tackled unsupervised sentence embedding by analyzing pretrained models and found that averaging all tokens, combining top and bottom layers, and applying a simple whitening-based normalization strategy consistently improves performance across seven datasets.
Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.