CLOct 31, 2024

GPT or BERT: why not both?

Lucas Georges Gabriel Charpentier, David Samuel

arXiv:2410.24159v214.929 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more flexible language models that can handle both modeling paradigms, though it appears incremental as it builds on existing transformer methods.

The authors tackled the problem of combining masked and causal language modeling by introducing a hybrid training objective, resulting in GPT-BERT which outperforms masked-only or causal-only models on the BabyLM Challenge 2024.

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

View on arXiv PDF Code

Similar