CLSep 10, 2021

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

arXiv:2109.04607v1664 citations
Originality Incremental advance
AI Analysis

This work addresses efficient model adaptation for Indonesian Twitter users, but it is incremental as it builds on existing BERT methods with domain-specific tweaks.

The authors tackled the problem of vocabulary mismatch in adapting a monolingual Indonesian BERT model to Indonesian Twitter by proposing IndoBERTweet, which uses additive domain-specific vocabulary with an initialization method based on average BERT subword embeddings, resulting in pretraining that is five times faster and more effective across seven Twitter datasets.

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes