Formalising lexical and syntactic diversity for data sampling in French
This work addresses dataset creation challenges for French language processing, but it is incremental as it builds on existing diversity concepts without major breakthroughs.
The paper tackled the problem of efficiently sampling diverse data for French datasets by proposing a heuristic that significantly increases diversity compared to random sampling, and explored correlations between lexical and syntactic diversity, finding that these correlations vary across datasets and measures.
Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.