CLApr 5, 2020

Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2

He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li

arXiv:2004.02251v227.6716 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses a specific problem in text generation for NLP researchers, but it is incremental as it builds on existing GPT2 models with minor modifications.

The study investigated how end-of-paragraph and end-of-sequence tokens affect text generation quality, finding that fine-tuning GPT2 to generate these tokens improves continuation quality, with experiments showing higher BLEU scores and lower perplexity in English story generation and better essay endings in Chinese.

The semantics of a text is manifested not only by what is read, but also by what is not read. In this article, we will study how the implicit "not read" information such as end-of-paragraph (\eop) and end-of-sequence (\eos) affect the quality of text generation. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the \eop in the fine-tuning stage. Experimental results on English story generation show that \eop can lead to higher BLEU score and lower \eos perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without \eop or \eos during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with \eop.

View on arXiv PDF Code

Similar