TurkicNLP: An NLP Toolkit for Turkic Languages
This addresses the lack of unified NLP tooling for Turkic languages, benefiting researchers and developers working with these languages, though it is incremental as it integrates existing methods into a new framework.
The authors tackled the problem of fragmented NLP resources for Turkic languages by developing TurkicNLP, an open-source Python library that provides a unified NLP pipeline covering tokenization, morphological analysis, and machine translation across four script families, resulting in a single, consistent tool for over 200 million speakers.
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .