CLSep 29, 2021

Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets

arXiv:2110.15712v10.26 citations
Originality Incremental advance
AI Analysis

This work addresses the optimization of pre-training strategies for MRC tasks, offering incremental insights into tailoring MLM objectives for specific dataset characteristics.

The paper investigates whether the masking length distribution in masked language models (MLM) correlates with answer length in machine reading comprehension (MRC) tasks, finding that aligning MLM mask lengths with answer lengths improves performance on four newly created Chinese MRC datasets.

Machine reading comprehension (MRC) is a challenging natural language processing (NLP) task. Recently, the emergence of pre-trained models (PTM) has brought this research field into a new era, in which the training objective plays a key role. The masked language model (MLM) is a self-supervised training objective that widely used in various PTMs. With the development of training objectives, many variants of MLM have been proposed, such as whole word masking, entity masking, phrase masking, span masking, and so on. In different MLM, the length of the masked tokens is different. Similarly, in different machine reading comprehension tasks, the length of the answer is also different, and the answer is often a word, phrase, or sentence. Thus, in MRC tasks with different answer lengths, whether the length of MLM is related to performance is a question worth studying. If this hypothesis is true, it can guide us how to pre-train the MLM model with a relatively suitable mask length distribution for MRC task. In this paper, we try to uncover how much of MLM's success in the machine reading comprehension tasks comes from the correlation between masking length distribution and answer length in MRC dataset. In order to address this issue, herein, (1) we propose four MRC tasks with different answer length distributions, namely short span extraction task, long span extraction task, short multiple-choice cloze task, long multiple-choice cloze task; (2) four Chinese MRC datasets are created for these tasks; (3) we also have pre-trained four masked language models according to the answer length distributions of these datasets; (4) ablation experiments are conducted on the datasets to verify our hypothesis. The experimental results demonstrate that our hypothesis is true.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes