CL AISep 14, 2023

DebCSE: Rethinking Unsupervised Contrastive Sentence Embedding Learning in the Debiasing Perspective

arXiv:2309.07396v111 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses the problem of learning fine-grained semantics in sentence embeddings for NLP applications, though it is incremental as it builds on existing contrastive methods like SimCSE and ConSERT.

The paper tackles biases in unsupervised contrastive sentence embedding learning, such as sentence length and false negative sample biases, by proposing DebCSE, which uses inverse propensity weighted sampling to select high-quality pairs, achieving an average Spearman's correlation of 80.33% on STS benchmarks with BERTbase.

Several prior studies have suggested that word frequency biases can cause the Bert model to learn indistinguishable sentence embeddings. Contrastive learning schemes such as SimCSE and ConSERT have already been adopted successfully in unsupervised sentence embedding to improve the quality of embeddings by reducing this bias. However, these methods still introduce new biases such as sentence length bias and false negative sample bias, that hinders model's ability to learn more fine-grained semantics. In this paper, we reexamine the challenges of contrastive sentence embedding learning from a debiasing perspective and argue that effectively eliminating the influence of various biases is crucial for learning high-quality sentence embeddings. We think all those biases are introduced by simple rules for constructing training data in contrastive learning and the key for contrastive learning sentence embedding is to mimic the distribution of training data in supervised machine learning in unsupervised way. We propose a novel contrastive framework for sentence embedding, termed DebCSE, which can eliminate the impact of these biases by an inverse propensity weighted sampling method to select high-quality positive and negative pairs according to both the surface and semantic similarity between sentences. Extensive experiments on semantic textual similarity (STS) benchmarks reveal that DebCSE significantly outperforms the latest state-of-the-art models with an average Spearman's correlation coefficient of 80.33% on BERTbase.

View on arXiv PDF

Similar