Barack Wanjawa

h-index4

7papers

125citations

Novelty20%

AI Score23

Ranked #174,974 of 194,257 authors (top 90%)#29,033 in CL (top 94%)

7 Papers

1.1CLOct 29, 2022Code

Phonemic Representation and Transcription for Speech to Text Applications for Under-resourced Indigenous African Languages: The Case of Kiswahili

Ebbie Awino, Lilian Wanzare, Lawrence Muchemi et al.

Building automatic speech recognition (ASR) systems is a challenging task, especially for under-resourced languages that need to construct corpora nearly from scratch and lack sufficient training data. It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced. ASR systems are crucial, particularly for the hearing-impaired persons who can benefit from having transcripts in their native languages. However, the absence of transcribed speech datasets has complicated efforts to develop ASR models for these indigenous languages. This paper explores the transcription process and the development of a Kiswahili speech corpus, which includes both read-out texts and spontaneous speech data from native Kiswahili speakers. The study also discusses the vowels and consonants in Kiswahili and provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox, an open-source speech recognition toolkit. The ASR model was trained using an extended phonetic set that yielded a WER and SER of 18.87% and 49.5%, respectively, an improved performance than previous similar research for under-resourced languages.

6.9CLAug 25, 2022

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Barack Wanjawa, Lilian Wanzare, Florence Indede et al.

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.

2.6CLMay 4, 2022

KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

Barack W. Wanjawa, Lilian D. A. Wanzare, Florence Indede et al.

The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

2.7CLJan 16, 2025

Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili

Barack Wamkaya Wanjawa, Lawrence Muchemi, Evans Miriti

Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match.

1.2STDec 5, 2016

Evaluating the Performance of ANN Prediction System at Shanghai Stock Market in the Period 21-Sep-2016 to 11-Oct-2016

Barack Wamkaya Wanjawa

This research evaluates the performance of an Artificial Neural Network based prediction system that was employed on the Shanghai Stock Exchange for the period 21-Sep-2016 to 11-Oct-2016. It is a follow-up to a previous paper in which the prices were predicted and published before September 21. Stock market price prediction remains an important quest for investors and researchers. This research used an Artificial Intelligence system, being an Artificial Neural Network that is feedforward multi-layer perceptron with error backpropagation for prediction, unlike other methods such as technical, fundamental or time series analysis. While these alternative methods tend to guide on trends and not the exact likely prices, neural networks on the other hand have the ability to predict the real value prices, as was done on this research. Nonetheless, determination of suitable network parameters remains a challenge in neural network design, with this research settling on a configuration of 5:21:21:1 with 80% training data or 4-year of training data as a good enough model for stock prediction, as already determined in a previous research by the author. The comparative results indicate that neural network can predict typical stock market prices with mean absolute percentage errors that are as low as 1.95% over the ten prediction instances that was studied in this research.

1.9LGSep 17, 2016

Predicting Future Shanghai Stock Market Price using ANN in the Period 21-Sep-2016 to 11-Oct-2016

Barack Wamkaya Wanjawa

Predicting the prices of stocks at any stock market remains a quest for many investors and researchers. Those who trade at the stock market tend to use technical, fundamental or time series analysis in their predictions. These methods usually guide on trends and not the exact likely prices. It is for this reason that Artificial Intelligence systems, such as Artificial Neural Network, that is feedforward multi-layer perceptron with error backpropagation, can be used for such predictions. A difficulty in neural network application is the determination of suitable network parameters. A previous research by the author already determined the network parameters as 5:21:21:1 with 80% training data or 4-year of training data as a good enough model for stock prediction. This model has been put to the test in predicting selected Shanghai Stock Exchange stocks in the future period of 21-Sep-2016 to 11-Oct-2016, about one week after the publication of these predictions. The research aims at confirming that simple neural network systems can be quite powerful in typical stock market predictions.

8.6STDec 17, 2014

ANN Model to Predict Stock Prices at Stock Exchange Markets

B. W. Wanjawa, L. Muchemi

Stock exchanges are considered major players in financial sectors of many countries. Most Stockbrokers, who execute stock trade, use technical, fundamental or time series analysis in trying to predict stock prices, so as to advise clients. However, these strategies do not usually guarantee good returns because they guide on trends and not the most likely price. It is therefore necessary to explore improved methods of prediction. The research proposes the use of Artificial Neural Network that is feedforward multi-layer perceptron with error backpropagation and develops a model of configuration 5:21:21:1 with 80% training data in 130,000 cycles. The research develops a prototype and tests it on 2008-2012 data from stock markets e.g. Nairobi Securities Exchange and New York Stock Exchange, where prediction results show MAPE of between 0.71% and 2.77%. Validation done with Encog and Neuroph realized comparable results. The model is thus capable of prediction on typical stock markets.