CLOct 8, 2022

SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning

arXiv:2210.03963v383 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for effective discrete data augmentation methods in contrastive sentence representation learning, offering incremental improvements over existing approaches like SimCSE.

The paper tackled the problem of data augmentation in unsupervised sentence representation learning by proposing three simple discrete augmentation schemes (punctuation insertion, modal verbs, and double negation) to balance semantic consistency and expression diversity, achieving consistent superiority in semantic textual similarity across diverse datasets.

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes