AI IRMay 22

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

Amirhossein Yousefiramandi, Ciaran Cooney

arXiv:2605.2429624.5Has Code

AI Analysis

For practitioners in patent classification and low-resource NLP, this work provides controlled evidence that synthetic data gains are often overestimated due to volume confounds, and offers practical guidelines for mixing real and synthetic data.

The paper investigates when LLM-generated synthetic data helps low-resource multi-label patent classification, finding that the apparent micro-F1 jump from 0.120 to 0.702 is largely volume-driven, with a controlled synthetic gain of only +0.024 over a duplicate-to-match real-only control. The study reveals that fidelity metrics change meaning with scale and that synthetic data benefits are task- and metric-specific.

We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-12B), four real-data regimes, 64 WIPO assistive-technology labels, two generation strategies, and three classifier families, the headline BERT-for-Patents micro-F1 jump from 0.120 to 0.702 is largely volume-driven. A duplicate-to-match real-only control that resamples 165 patents to the augmented size reaches 0.678; the controlled synthetic gain is only +0.024 over this control, but +0.219 over focal-loss reweighting, the strongest non-augmentation baseline. The main finding is that fidelity metrics change meaning with scale: at extreme scarcity, MMD correlates positively with classification gain (r=+0.95), but at 1:10 the relation flips (r=-0.73; Fisher z=+6.47, p<0.001). Fixed-budget mixing finds a 20-30% real / 70-80% synthetic optimum; paraphrase scaling collapses from a 165-document seed; and shuffled mixing beats curriculum ordering, ensembling, and classifier-based filtering. Leakage controls -- label-name masking, instruction-level label removal, fine-grained evaluation, and keyword-overlap audits -- argue against label-string dependence as the main driver for BERT-for-Patents. The apparent ModernBERT collapse under label removal is traced to a Flash-Attention-2 + bf16 numerical artifact, recovering 65% of lost performance with fp32 eager attention. Finally, the same corpus that improves classification by up to +0.58 raw micro-F1 hurts a Jaccard-label-overlap retrieval proxy; even a standard-patent-only filter leaves a 26% nDCG@10 drop. Thus, synthetic patent text is task- and metric-specific, not reducible to prompt genre alone.

View on arXiv PDF

Similar