RoBERTurk: Adjusting RoBERTa for Turkish
This work addresses natural language processing for Turkish, providing a resource-efficient model, though it shows mixed results compared to existing methods.
The researchers adapted RoBERTa for Turkish by pretraining it on Turkish corpora with a BPE tokenizer, achieving better performance than BERTurk models on the BOUN dataset for POS tagging but underperforming on the IMST dataset for the same task, while being competitive on the XTREME dataset for NER with less pretraining data.
We pretrain RoBERTa on a Turkish corpora using BPE tokenizer. Our model outperforms BERTurk family models on the BOUN dataset for the POS task while resulting in underperformance on the IMST dataset for the same task and achieving competitive scores on the Turkish split of the XTREME dataset for the NER task - all while being pretrained on smaller data than its competitors. We release our pretrained model and tokenizer.