Bashir Shehu Galadanci

h-index4

5papers

615citations

Novelty31%

AI Score25

Ranked #165,734 of 194,257 authors (top 85%)#28,143 in CL (top 91%)

5 Papers

31.2CLMay 2, 2022

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud et al.

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.

1.9CLOct 17, 2024

Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?

Idris Abdulmumin, Bashir Shehu Galadanci, Garba Aliyu et al.

Monolingual data, being readily available in large quantities, has been used to upscale the scarcely available parallel data to train better models for automatic translation. Self-learning, where a model is made to learn from its output, is one approach to exploit such data. However, it has been shown that too much of this data can be detrimental to the performance of the model if the available parallel data is comparatively extremely low. In this study, we investigate whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model. Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or closeness to the domain of the test data, than utilizing all of the available data.

0.7CLNov 14, 2020

A Hybrid Approach for Improved Low Resource Neural Machine Translation using Monolingual Data

Idris Abdulmumin, Bashir Shehu Galadanci, Abubakar Isa et al.

Many language pairs are low resource, meaning the amount and/or quality of available parallel data is not sufficient to train a neural machine translation (NMT) model which can reach an acceptable standard of accuracy. Many works have explored using the readily available monolingual data in either or both of the languages to improve the standard of translation models in low, and even high, resource languages. One of the most successful of such works is the back-translation that utilizes the translations of the target language monolingual data to increase the amount of the training data. The quality of the backward model which is trained on the available parallel data has been shown to determine the performance of the back-translation approach. Despite this, only the forward model is improved on the monolingual target data in standard back-translation. A previous study proposed an iterative back-translation approach for improving both models over several iterations. But unlike in the traditional back-translation, it relied on both the target and source monolingual data. This work, therefore, proposes a novel approach that enables both the backward and forward models to benefit from the monolingual target data through a hybrid of self-learning and back-translation respectively. Experimental results have shown the superiority of the proposed approach over the traditional back-translation method on English-German low resource neural machine translation. We also proposed an iterative self-learning approach that outperforms the iterative back-translation while also relying only on the monolingual target data and require the training of less models.

1.2CLDec 22, 2019

Tag-less Back-Translation

Idris Abdulmumin, Bashir Shehu Galadanci, Aliyu Garba

An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of the back-translations of the target-side monolingual data. The standard back-translation method has been shown to be unable to efficiently utilize the available huge amount of existing monolingual data because of the inability of translation models to differentiate between the authentic and synthetic parallel data during training. Tagging, or using gates, has been used to enable translation models to distinguish between synthetic and authentic data, improving standard back-translation and also enabling the use of iterative back-translation on language pairs that underperformed using standard back-translation. In this work, we approach back-translation as a domain adaptation problem, eliminating the need for explicit tagging. In the approach -- \emph{tag-less back-translation} -- the synthetic and authentic parallel data are treated as out-of-domain and in-domain data respectively and, through pre-training and fine-tuning, the translation model is shown to be able to learn more efficiently from them during training. Experimental results have shown that the approach outperforms the standard and tagged back-translation approaches on low resource English-Vietnamese and English-German neural machine translation.

0.2CLNov 26, 2019

Iterative Batch Back-Translation for Neural Machine Translation: A Conceptual Model

Idris Abdulmumin, Bashir Shehu Galadanci, Abubakar Isa

An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of back-translations of the target-side monolingual data. Recently, iterative back-translation has been shown to outperform standard back-translation albeit on some language pairs. This work proposes the iterative batch back-translation that is aimed at enhancing the standard iterative back-translation and enabling the efficient utilization of more monolingual data. After each iteration, improved back-translations of new sentences are added to the parallel data that will be used to train the final forward model. The work presents a conceptual model of the proposed approach.