CLSep 10, 2021

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

Fajri Koto, Jey Han Lau, Timothy Baldwin

arXiv:2109.04607v130.9664 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses efficient model adaptation for Indonesian Twitter users, but it is incremental as it builds on existing BERT methods with domain-specific tweaks.

The authors tackled the problem of vocabulary mismatch in adapting a monolingual Indonesian BERT model to Indonesian Twitter by proposing IndoBERTweet, which uses additive domain-specific vocabulary with an initialization method based on average BERT subword embeddings, resulting in pretraining that is five times faster and more effective across seven Twitter datasets.

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

View on arXiv PDF Code

Similar