CL LGAug 5, 2025

Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

Rita González-Márquez, Philipp Berens, Dmitry Kobak

arXiv:2508.03453v1h-index: 3

Originality Incremental advance

AI Analysis

This work addresses the need for efficient self-supervised methods in NLP to reduce reliance on curated data, though it is incremental as it builds on existing augmentation strategies from computer vision.

The paper tackled the problem of generating high-quality text embeddings by comparing cropping and dropout as augmentation strategies in self-supervised contrastive learning, finding that cropping strongly outperforms dropout and produces embeddings competitive with supervised state-of-the-art models on in-domain data after short fine-tuning.

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

View on arXiv PDF

Similar