CLJan 30

From Labels to Facets: Building a Taxonomically Enriched Turkish Learner Corpus

Elif Sayar, Tolgahan Türker, Anna Golynskaia Knezhevich, Bihter Dereli, Ayşe Demirhas, Lionel Nicolas, Gülşen Eryiğit

arXiv:2601.22875v20.6h-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of coarse-grained error analysis in learner corpora for linguistics and language education researchers, representing an incremental improvement through application of an existing taxonomy to Turkish data.

The paper tackles the limitation of flat label annotations in learner corpora by developing a semi-automated annotation methodology based on a faceted taxonomy, achieving 95.86% facet-level accuracy in extending Turkish learner corpus annotations to enable fine-grained linguistic analysis.

In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.

View on arXiv PDF

Similar