Logographic Information Aids Learning Better Representations for Natural Language Inference
This addresses the challenge of learning better semantic representations in data-scarce scenarios for languages like Chinese and Vietnamese, though it is incremental as it builds on existing multi-modal approaches.
The paper tackled the problem of language models ignoring logographic features in written text, especially under data sparsity, by exploring multi-modal representations combining contextual and glyph information for natural language inference; the result showed significant benefits in languages with logographic systems, particularly for low-frequency words, as evidenced by evaluation across six languages.
Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.