CLLGAug 31, 2017

Glyph-aware Embedding of Chinese Characters

arXiv:1709.00028v11095 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of modeling logographic scripts like Chinese for NLP practitioners, offering a domain-specific improvement.

The paper tackled the problem of representing Chinese characters for NLP tasks by incorporating visual glyph information into embeddings, achieving improved performance in language modeling and word segmentation with concrete gains over baseline methods.

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character's glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character's task-relevant semantic and syntactic information in the character-level embedding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes