CLMay 22, 2025

Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

arXiv:2505.16800v12.7h-index: 5

Originality Incremental advance

AI Analysis

This addresses data scarcity in morphological analysis for low-resource languages, though it appears incremental by applying existing techniques (multitask learning and synthetic data) to a specific domain.

The paper tackles morpheme segmentation for low-resource languages by combining multitask learning with LLM-generated synthetic data, achieving significant improvements in word-level accuracy and morpheme-level F1-score on the SIGMORPHON 2023 dataset.

We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

View on arXiv PDF

Similar