Bernard Opoku

CL
8papers
316citations
Novelty21%
AI Score36

8 Papers

CLFeb 17, 2023
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele et al.

Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (The AfriSenti Shared Task had over 200 participants. See website at https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the different datasets and discuss their usefulness.

ASJul 7, 2022
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

Josh Meyer, David Ifeoluwa Adelani, Edresson Casanova et al.

BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba. This corpus is a derivative work of Bible recordings made and released by the Open.Bible project from Biblica. We have aligned, cleaned, and filtered the original recordings, and additionally hand-checked a subset of the alignments for each language. We present results for text-to-speech models with Coqui TTS. The data is released under a commercial-friendly CC-BY-SA license.

CLNov 16, 2023
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal et al.

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

CLMar 29, 2021Code
Contextual Text Embeddings for Twi

Paul Azunre, Salomey Osei, Salomey Addo et al.

Transformer-based language models have been changing the modern Natural Language Processing (NLP) landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian language. In this paper, we introduce the first of such models for Twi or Akan, the most widely spoken Ghanaian language. The specific contribution of this research work is the development of several pretrained transformer language models for the Akuapem and Asante dialects of Twi, paving the way for advances in application areas such as Named Entity Recognition (NER), Neural Machine Translation (NMT), Sentiment Analysis (SA) and Part-of-Speech (POS) tagging. Specifically, we introduce four different flavours of ABENA -- A BERT model Now in Akan that is fine-tuned on a set of Akan corpora, and BAKO - BERT with Akan Knowledge only, which is trained from scratch. We open-source the model through the Hugging Face model hub and demonstrate its use via a simple sentiment classification example.

CLMar 29, 2021Code
NLP for Ghanaian Languages

Paul Azunre, Salomey Osei, Salomey Addo et al.

NLP Ghana is an open-source non-profit organization aiming to advance the development and adoption of state-of-the-art NLP techniques and digital language tools to Ghanaian languages and problems. In this paper, we first present the motivation and necessity for the efforts of the organization; by introducing some popular Ghanaian languages while presenting the state of NLP in Ghana. We then present the NLP Ghana organization and outline its aims, scope of work, some of the methods employed and contributions made thus far in the NLP community in Ghana.

40.2HCMar 22
A Parametric, Geometry-Aware Residential Construction Cost Estimation Model for Ghana: Design, Validation, and the "Completeness Gap" in Informal Contractor Quotes

Emmanuel Apaaboah, Bernard Opoku, the GhanaHousePlanner Research Team

Ghana faces a residential housing deficit of two million units. A key driver of project failure is the "completeness gap", a systematic discrepancy between informal contractor quotes and actual costs. Informal estimates often use flat per-square-metre pricing that omits essential structural and finishing components, leading to project abandonment mid-construction. This paper validates a parametric, geometry-aware cost estimation model via the GhanaHousePlanner (GHP) platform. The model provides self-builders with itemised bills of quantities (BoQ) reflecting the true cost of code-compliant construction in Ghana. The GHP model uses seven calculation modules: foundation, blockwork, cement, structural steel, roofing, plumbing, and electrical. It features a primary geometry-based mode and a formula-based fallback. Accuracy was tested using three case studies (75, 120, and 200 per-square-metre homes) benchmarked against February 2026 market prices in Greater Accra.GHP estimates (GHS 519,000 to GHS 1,398,000) were 29 to 98 per cent higher than typical informal quotes. This gap arises from the omission of structural steel (Y16 rebar), plastering, floor screed, and full services in informal estimates. Findings confirm that per-square-metre rates rarely cover the requirements for a fully completed, code-compliant building. The GHP model offers a transparent, auditable alternative to informal quoting. Despite material price volatility and labour market informality, the tool provides a framework for improving cost predictability and reducing project stalling in the sub-Saharan African housing market.

CLMay 11, 2023
AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera et al.

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.

CLMar 29, 2021
English-Twi Parallel Corpus for Machine Translation

Paul Azunre, Salomey Osei, Salomey Addo et al.

We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use as an evaluation set for downstream Natural Language Processing (NLP) tasks. The typical use case for the larger human-verified dataset is for further training of machine translation models in Akuapem Twi. The higher quality 697 crowd-sourced dataset is recommended as a testing dataset for machine translation of English to Twi and Twi to English models. Furthermore, the Twi part of the crowd-sourced data may also be used for other tasks, such as representation learning, classification, etc. We fine-tune the transformer translation model on the training corpus and report benchmarks on the crowd-sourced test set.