Not all layers are equally as important: Every Layer Counts BERT
This addresses the need for more efficient training in natural language processing, particularly for resource-constrained settings, and is incremental as it builds on existing transformer architectures.
The paper tackled the problem of data-efficient pretraining for language models by introducing a novel transformer modification that allows each layer to select which previous layer outputs to process, resulting in winning both the strict and strict-small tracks in the BabyLM challenge.
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.