CLLGOct 28, 2023

TLM: Token-Level Masking for Transformers

arXiv:2310.18738v1131 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses overfitting in Transformers for NLP practitioners, offering an incremental improvement over existing regularization techniques.

The paper tackles overfitting in Transformers by proposing Token-Level Masking (TLM), a regularization strategy that manipulates token connections in self-attention, resulting in consistent improvements over methods like DropHead, such as a 0.5-point increase on GLUE with BERT-large and a new record of 18.93 BLEU on Rotowire.

Structured dropout approaches, such as attention dropout and DropHead, have been investigated to regularize the multi-head attention mechanism in Transformers. In this paper, we propose a new regularization scheme based on token-level rather than structure-level to reduce overfitting. Specifically, we devise a novel Token-Level Masking (TLM) training strategy for Transformers to regularize the connections of self-attention, which consists of two masking techniques that are effective and easy to implement. The underlying idea is to manipulate the connections between tokens in the multi-head attention via masking, where the networks are forced to exploit partial neighbors' information to produce a meaningful representation. The generality and effectiveness of TLM are thoroughly evaluated via extensive experiments on 4 diversified NLP tasks across 18 datasets, including natural language understanding benchmark GLUE, ChineseGLUE, Chinese Grammatical Error Correction, and data-to-text generation. The results indicate that TLM can consistently outperform attention dropout and DropHead, e.g., it increases by 0.5 points relative to DropHead with BERT-large on GLUE. Moreover, TLM can establish a new record on the data-to-text benchmark Rotowire (18.93 BLEU). Our code will be publicly available at https://github.com/Young1993/tlm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes