CLJul 20, 2017

A Sub-Character Architecture for Korean Language Processing

arXiv:1707.06341v21093 citations
AI Analysis

This addresses data sparsity and accuracy issues in Korean NLP, though it is domain-specific to Korean language processing.

The authors tackled Korean language processing by introducing a sub-character architecture that decomposes characters into jamo letters, reducing the observation space to 1.6% of the original and achieving dramatic improvement in dependency parsing accuracy.

We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which character- and word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes