CL LGDec 30, 2020

Optimizing Deeper Transformers on Small Datasets

Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J. D. Prince, Yanshuai Cao

arXiv:2012.15355v428.1729 citations

Originality Highly original

AI Analysis

This work is significant for researchers and practitioners working with deep learning on small datasets, as it shows that deeper transformer architectures can be leveraged for challenging tasks like Text-to-SQL parsing, potentially improving generalization for complex reasoning.

This paper challenges the belief that deep transformers require large datasets, demonstrating that with proper initialization and optimization, very deep transformers can be effectively trained on small datasets. They successfully trained 48-layer transformers, achieving state-of-the-art performance on the cross-domain Text-to-SQL parsing benchmark Spider with fewer training steps and no task-specific pre-training.

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train $48$ layers of transformers, comprising $24$ fine-tuned layers from pre-trained RoBERTa and $24$ relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.

View on arXiv PDF

Similar