CLAug 8, 2023

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

arXiv:2308.04255v226 citationsh-index: 32
Originality Synthesis-oriented
AI Analysis

This work addresses the need for improved NLP tools for South Slavic languages, though it is incremental as it builds on an existing pipeline.

The authors tackled the problem of automatic linguistic annotation for South Slavic languages by developing CLASSLA-Stanza, a pipeline based on Stanza, which shows consistently high performance and outperforms or expands Stanza across all supported tasks.

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes