CL IRJun 14, 2023

Generate to Understand for Representation

Changshang Xue, Xiande Zhong, Xiaoqing Liu

arXiv:2306.10056v10.5h-index: 1Has Code

Originality Highly original

AI Analysis

This addresses the problem of resource-intensive model training for NLP practitioners, offering a more efficient alternative to scaling-based approaches.

The paper tackles the high costs of pretraining and finetuning language models by introducing GUR, a framework that combines masked language modeling and contrastive learning in a single step using unlabeled data, achieving state-of-the-art zero-shot retrieval performance on recall benchmarks.

In recent years, a significant number of high-quality pretrained models have emerged, greatly impacting Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text Representation tasks. Traditionally, these models are pretrained on custom domain corpora and finetuned for specific tasks, resulting in high costs related to GPU usage and labor. Unfortunately, recent trends in language modeling have shifted towards enhancing performance through scaling, further exacerbating the associated costs. Introducing GUR: a pretraining framework that combines language modeling and contrastive learning objectives in a single training step. We select similar text pairs based on their Longest Common Substring (LCS) from raw unlabeled documents and train the model using masked language modeling and unsupervised contrastive learning. The resulting model, GUR, achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting. Additionally, GUR maintains its language modeling ability, as demonstrated in our ablation experiment. Our code is available at \url{https://github.com/laohur/GUR}.

View on arXiv PDF Code

Similar