Contextual Text Embeddings for Twi
This work addresses the problem of limited NLP resources for low-resource languages like Twi, benefiting researchers and developers in Ghana and similar contexts, though it is incremental as it applies existing methods to a new language.
The paper tackled the lack of transformer-based language models for Ghanaian languages by developing the first such models for Twi (Akan), specifically for Akuapem and Asante dialects, resulting in open-source models like ABENA and BAKO that enable applications such as Named Entity Recognition and Sentiment Analysis.
Transformer-based language models have been changing the modern Natural Language Processing (NLP) landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian language. In this paper, we introduce the first of such models for Twi or Akan, the most widely spoken Ghanaian language. The specific contribution of this research work is the development of several pretrained transformer language models for the Akuapem and Asante dialects of Twi, paving the way for advances in application areas such as Named Entity Recognition (NER), Neural Machine Translation (NMT), Sentiment Analysis (SA) and Part-of-Speech (POS) tagging. Specifically, we introduce four different flavours of ABENA -- A BERT model Now in Akan that is fine-tuned on a set of Akan corpora, and BAKO - BERT with Akan Knowledge only, which is trained from scratch. We open-source the model through the Hugging Face model hub and demonstrate its use via a simple sentiment classification example.