Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives
This work addresses the performance gap between unsupervised and supervised sentence embedding methods for natural language processing applications, representing an incremental improvement.
The paper tackled the problem of improving unsupervised contrastive learning for sentence embeddings by enhancing the quality of positive and negative samples, resulting in a method that significantly surpasses the state-of-the-art on STS benchmarks.
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.