CLNov 24, 2022

InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

arXiv:2211.13376v10.3h-index: 7

Originality Synthesis-oriented

AI Analysis

This work addresses a domain-specific need for better NLP tools in Indonesian language processing, but it is incremental as it builds on existing cloze test methods with a new dataset and minor model adjustments.

The authors tackled the problem of cloze test reading comprehension for Indonesian idioms and expressions by creating the InDEX dataset with 10,438 sentences and 289 idioms, and found that combining definition and random initialization improves model performance for idioms, while static embedding suffices for fixed expressions.

We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.

View on arXiv PDF

Similar