CLCVSep 2, 2024

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Tsinghua
arXiv:2409.01011v11 citationsh-index: 44
Originality Incremental advance
AI Analysis

This work addresses the challenge of tokenizing complex ancient Chinese scripts for the academic community, though it is incremental as it builds on existing tokenization methods.

This study tackled the problem of analyzing ancient Chinese Chu bamboo slip scripts by developing a multi-modal multi-granularity tokenizer, resulting in a 5.5% relative improvement in F1-score on part-of-speech tagging compared to mainstream sub-word tokenizers.

This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes