CLMar 18, 2025

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

arXiv:2503.14718v118.213 citationsh-index: 5Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of morphosyntactic analysis for second language Korean learners by providing an enhanced dataset, though it is incremental as it builds on existing treebank and fine-tuning methods.

The authors expanded a second language Korean Universal Dependencies treebank by adding 5,454 manually annotated sentences and refining the annotation guidelines, and fine-tuning three Korean language models on this dataset significantly improved their performance on in-domain and out-of-domain L2-Korean datasets.

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

View on arXiv PDF Code

Similar