FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
This addresses the problem of developing language models that cater to the unique language needs of youth, though it is incremental as it focuses on data collection rather than novel methods.
The authors tackled the lack of linguistic resources for children and adolescents by introducing the French-YMCA corpus, which contains 22,471,898 words from 39,200 text files to support age-appropriate language models.
In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.