CLFeb 16

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

arXiv:2602.14819v1h-index: 3
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for researchers in NLP and sociolinguistics focusing on Italian language modeling and digital communication analysis, though it is incremental as it applies existing data collection methods to a new language-specific dataset.

The authors introduced a large-scale Italian discussion board corpus spanning from 1996 to 2024, containing over 30 billion word-tokens, to serve as a dataset for training native Italian large language models and for sociolinguistic research.

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes