CLAINov 21, 2023

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

arXiv:2311.12475v210 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This work addresses a specific limitation in Thai language modeling for NLP applications, but it is incremental as it builds directly on an existing model.

The authors tackled the problem of WangchanBERTa's poor understanding of unassimilated English loanwords in Thai by expanding its vocabulary with XLM-R's tokenizer and pretraining on a larger dataset, resulting in PhayaThaiBERT, which outperforms WangchanBERTa in many downstream tasks.

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes