CLNov 10, 2023

Heaps' Law in GPT-Neo Large Language Model Emulated Corpora

Uyen Lai, Gurjit S. Randhawa, Paul Sheridan

arXiv:2311.06377v10.92 citationsh-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of understanding text generation properties in AI models for researchers in computational linguistics and natural language processing, but it is incremental as it extends known empirical laws to new data.

The study investigated whether Heaps' law, which predicts vocabulary growth in text corpora, applies to text generated by GPT-Neo large language models, finding that the generated corpora adhere to the law and that larger models produce vocabulary growth more similar to human-authored text.

Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.

View on arXiv PDF Code

Similar