CLSep 30, 2025

MGen: Millions of Naturally Occurring Generics in Context

arXiv:2509.26160v11 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This dataset enables large-scale computational research on genericity, addressing a data bottleneck for linguists and AI researchers working on natural language understanding.

The authors tackled the lack of large-scale data for studying generic sentences by creating MGen, a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources, which is the biggest and most diverse such dataset available.

MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes