CLOct 31, 2024

GPT or BERT: why not both?

arXiv:2410.24159v229 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the need for more flexible language models that can handle both modeling paradigms, though it appears incremental as it builds on existing transformer methods.

The authors tackled the problem of combining masked and causal language modeling by introducing a hybrid training objective, resulting in GPT-BERT which outperforms masked-only or causal-only models on the BabyLM Challenge 2024.

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes