CLAIMay 24, 2022

PERT: A New Solution to Pinyin to Character Conversion Task

arXiv:2205.11737v12 citationsh-index: 33
Originality Incremental advance
AI Analysis

This work addresses a key problem in commercial input software for Asian languages, offering incremental improvements over existing methods.

The paper tackled the Pinyin to Character conversion task for Asian language input methods by proposing PERT, a bidirectional Transformer-based model, which achieved significant performance improvements over n-gram and RNN baselines, with further gains from combining it with n-gram and incorporating external lexicon to address out-of-distribution issues.

Pinyin to Character conversion (P2C) task is the key task of Input Method Engine (IME) in commercial input software for Asian languages, such as Chinese, Japanese, Thai language and so on. It's usually treated as sequence labelling task and resolved by language model, i.e. n-gram or RNN. However, the low capacity of the n-gram or RNN limits its performance. This paper introduces a new solution named PERT which stands for bidirectional Pinyin Encoder Representations from Transformers. It achieves significant improvement of performance over baselines. Furthermore, we combine PERT with n-gram under a Markov framework, and improve performance further. Lastly, the external lexicon is incorporated into PERT so as to resolve the OOD issue of IME.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes