CLMar 26

Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

arXiv:2603.2522778.1h-index: 6

AI Analysis

This addresses the problem of data quality in linguistic evaluation for LLMs, showing that natural data is superior for capturing abstract patterns, though the findings are incremental as they build on existing structured evaluation methods.

The study compared the impact of natural versus synthetic data on training and evaluating large language models for passive verb alternation in French and Italian, finding that models trained on natural data achieved robust performance across both data types, while those trained on synthetic data failed to generalize to natural sentences.

This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.

View on arXiv PDF

Similar