Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties
This work addresses a domain-specific problem in semantic relation extraction for researchers and practitioners using SLMs with SHACL shapes, offering incremental improvements through practical training strategies.
The paper tackled the challenge of small language models (SLMs) struggling with long-tail distributions of rare properties when extracting both datatype and object properties for complete RDF graph extraction, finding that building a training set where each property exceeds a threshold occurrence performs best for balanced performance.
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.