CLSep 23, 2023

GlotScript: A Resource and Tool for Low Resource Writing System Identification

arXiv:2309.13320v289 citationsh-index: 70Has Code
Originality Synthesis-oriented
AI Analysis

This provides a tool for NLP researchers working on low-resource languages, though it is incremental as it aggregates existing resources.

The authors tackled the problem of identifying writing systems for low-resource languages by introducing GlotScript, a resource and tool that covers over 7,000 languages and 161 Unicode scripts, and demonstrated its utility in cleaning multilingual corpora and analyzing language model tokenization.

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes