CLApr 15, 2022

Chinese Idiom Paraphrasing

arXiv:2204.07555v2131 citationsh-index: 49
Originality Synthesis-oriented
AI Analysis

This addresses a domain-specific problem for Chinese NLP systems, such as machine translation and idiom cloze, by facilitating dataset pre-processing, but it is incremental as it adapts existing paraphrase generation techniques.

This study tackles the problem of Chinese idioms being hard to understand due to their non-compositional and metaphorical nature by proposing the Chinese Idiom Paraphrasing (CIP) task, which rephrases idiom-included sentences into non-idiomatic ones while preserving meaning, and establishes a dataset of 115,530 sentence pairs, with proposed methods outperforming baselines.

Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes