CLNov 26, 2021Code
KazNERD: Kazakh Named Entity Recognition DatasetRustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan Varol
We present the development of a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes. State-of-the-art machine learning models to automatise Kazakh named entity recognition were also built, with the best-performing model achieving an exact match F1-score of 97.22% on the test set. The annotated dataset, guidelines, and codes used to train the models are freely available for download under the CC BY 4.0 licence from https://github.com/IS2AI/KazNERD.
ASJul 30, 2021Code
USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition ExperimentsMuhammadjon Musaev, Saida Mussakhojayeva, Ilyos Khujayorov et al.
We present a freely available speech corpus for the Uzbek language and report preliminary automatic speech recognition (ASR) results using both the deep neural network hidden Markov model (DNN-HMM) and end-to-end (E2E) architectures. The Uzbek speech corpus (USC) comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. To the best of our knowledge, this is the first open-source Uzbek speech corpus dedicated to the ASR task. To ensure high quality, the USC has been manually checked by native speakers. We first describe the design and development procedures of the USC, and then explain the conducted ASR experiments in detail. The experimental results demonstrate promising results for the applicability of the USC for ASR. Specifically, 18.1% and 17.4% word error rates were achieved on the validation and test sets, respectively. To enable experiment reproducibility, we share the USC dataset, pre-trained models, and training recipes in our GitHub repository.
ASApr 17, 2021Code
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis DatasetSaida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov et al.
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges, and discuss important future directions. To demonstrate the reliability of our dataset, we built baseline end-to-end TTS models and evaluated them using the subjective mean opinion score (MOS) measure. Evaluation results show that the best TTS models trained on our dataset achieve MOS above 4 for both speakers, which makes them applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available.
ASSep 22, 2020Code
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition BaselineYerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov et al.
We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.
CLMar 28, 2024
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and AttitudesRustem Yeshpanov, Huseyin Atakan Varol
This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
CLMar 28, 2024
KazParC: Kazakh Parallel Corpus for Machine TranslationRustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol
We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
QMFeb 15, 2025
Breast Lump Detection and Localization with a Tactile Glove Using Deep LearningTogzhan Syrymova, Amir Yelenov, Karina Burunchina et al.
Breast cancer is the leading cause of mortality among women. Inspection of breasts by palpation is the key to early detection. We aim to create a wearable tactile glove that could localize the lump in breasts using deep learning (DL). In this work, we present our flexible fabric-based and soft wearable tactile glove for detecting the lumps within custom-made silicone breast prototypes (SBPs). SBPs are made of soft silicone that imitates the human skin and the inner part of the breast. Ball-shaped silicone tumors of 1.5-, 1.75- and 2.0-cm diameters are embedded inside to create another set with lumps. Our approach is based on the InceptionTime DL architecture with transfer learning between experienced and non-experienced users. We collected a dataset from 10 naive participants and one oncologist-mammologist palpating SBPs. We demonstrated that the DL model can classify lump presence, size and location with an accuracy of 82.22%, 67.08% and 62.63%, respectively. In addition, we showed that the model adapted to unseen experienced users with an accuracy of 95.01%, 88.54% and 82.98% for lump presence, size and location classification, respectively. This technology can assist inexperienced users or healthcare providers, thus facilitating more frequent routine checks.
CVMay 12, 2023
A Central Asian Food Dataset for Personalized Dietary Interventions, Extended AbstractAknur Karabay, Arman Bolatov, Huseyin Atakan Varol et al.
Nowadays, it is common for people to take photographs of every beverage, snack, or meal they eat and then post these photographs on social media platforms. Leveraging these social trends, real-time food recognition and reliable classification of these captured food images can potentially help replace some of the tedious recording and coding of food diaries to enable personalized dietary interventions. Although Central Asian cuisine is culturally and historically distinct, there has been little published data on the food and dietary habits of people in this region. To fill this gap, we aim to create a reliable dataset of regional foods that is easily accessible to both public consumers and researchers. To the best of our knowledge, this is the first work on creating a Central Asian Food Dataset (CAFD). The final dataset contains 42 food categories and over 16,000 images of national dishes unique to this region. We achieved a classification accuracy of 88.70\% (42 classes) on the CAFD using the ResNet152 neural network model. The food recognition models trained on the CAFD demonstrate computer vision's effectiveness and high accuracy for dietary assessment.
ASJan 15, 2022
KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and TopicsSaida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol
We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified with the help of new sources, including a book and Wikipedia articles. This corpus is necessary for building high-quality TTS systems for Kazakh, a Central Asian agglutinative language from the Turkic family, which presents several linguistic challenges. We describe the corpus construction process and provide the details of the training and evaluation procedures for the TTS system. Our experimental results indicate that the constructed corpus is sufficient to build robust TTS models for real-world applications, with a subjective mean opinion score ranging from 3.6 to 4.2 for all the five speakers. We believe that our corpus will facilitate speech and language research for Kazakh and other Turkic languages, which are widely considered to be low-resource due to the limited availability of free linguistic data. The constructed corpus, code, and pretrained models are publicly available in our GitHub repository.
CVOct 23, 2021
A Study of Multimodal Person Verification Using Audio-Visual-Thermal DataMadina Abdrakhmanova, Saniya Abushakimova, Yerbolat Khassanov et al.
In this paper, we study an approach to multimodal person verification using audio, visual, and thermal modalities. The combination of audio and visual modalities has already been shown to be effective for robust person verification. From this perspective, we investigate the impact of further increasing the number of modalities by adding thermal images. In particular, we implemented unimodal, bimodal, and trimodal verification systems using state-of-the-art deep learning architectures and compared their performance under clean and noisy conditions. We also compared two popular fusion approaches based on simple score averaging and the soft attention mechanism. The experiment conducted on the SpeakingFaces dataset demonstrates the superior performance of the trimodal verification system. Specifically, on the easy test set, the trimodal system outperforms the best unimodal and bimodal systems by over 50% and 18% relative equal error rates, respectively, under both the clean and noisy conditions. On the hard test set, the trimodal system outperforms the best unimodal and bimodal systems by over 40% and 13% relative equal error rates, respectively, under both the clean and noisy conditions. To enable reproducibility of the experiment and facilitate research into multimodal person verification, we made our code, pretrained models, and preprocessed dataset freely available in our GitHub repository.
ASAug 3, 2021
A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and EnglishSaida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol
We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English. We first describe the development of multilingual E2E ASR based on Transformer networks and then perform an extensive assessment on the aforementioned languages. We also compare two variants of output grapheme set construction: combined and independent. Furthermore, we evaluate the impact of LMs and data augmentation techniques on the recognition performance of the multilingual E2E ASR. In addition, we present several datasets for training and evaluation purposes. Experiment results show that the multilingual models achieve comparable performances to the monolingual baselines with a similar number of parameters. Our best monolingual and multilingual models achieved 20.9% and 20.5% average word error rates on the combined test set, respectively. To ensure the reproducibility of our experiments and results, we share our training recipes, datasets, and pre-trained models.
IVJul 19, 2021
Input Agnostic Deep Learning for Alzheimer's Disease Classification Using Multimodal MRI ImagesAidana Massalimova, Huseyin Atakan Varol
Alzheimer's disease (AD) is a progressive brain disorder that causes memory and functional impairments. The advances in machine learning and publicly available medical datasets initiated multiple studies in AD diagnosis. In this work, we utilize a multi-modal deep learning approach in classifying normal cognition, mild cognitive impairment and AD classes on the basis of structural MRI and diffusion tensor imaging (DTI) scans from the OASIS-3 dataset. In addition to a conventional multi-modal network, we also present an input agnostic architecture that allows diagnosis with either sMRI or DTI scan, which distinguishes our method from previous multi-modal machine learning-based methods. The results show that the input agnostic model achieves 0.96 accuracy when both structural MRI and DTI scans are provided as inputs.
SYMay 28, 2021
End-to-End Deep Fault Tolerant ControlDaulet Baimukashev, Bexultan Rakhim, Matteo Rubagotti et al.
PUBLISHED ON IEEE/ASME TRANSACTIONS ON MECHATRONICS, DOI: 10.1109/TMECH.2021.3100150. Ideally, accurate sensor measurements are needed to achieve a good performance in the closed-loop control of mechatronic systems. As a consequence, sensor faults will prevent the system from working correctly, unless a fault-tolerant control (FTC) architecture is adopted. As model-based FTC algorithms for nonlinear systems are often challenging to design, this paper focuses on a new method for FTC in the presence of sensor faults, based on deep learning. The considered approach replaces the phases of fault detection and isolation and controller design with a single recurrent neural network, which has the value of past sensor measurements in a given time window as input, and the current values of the control variables as output. This end-to-end deep FTC method is applied to a mechatronic system composed of a spherical inverted pendulum, whose configuration is changed via reaction wheels, in turn actuated by electric motors. The simulation and experimental results show that the proposed method can handle abrupt faults occurring in link position/velocity sensors. The provided supplementary material includes a video of real-world experiments and the software source code.
HCDec 5, 2020
SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video StreamsMadina Abdrakhmanova, Askat Kuzdeuov, Sheikh Jarju et al.
We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction, biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (~3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.
IVMar 19, 2020
End-to-End Deep Diagnosis of X-ray ImagesKudaibergen Urinbayev, Yerassyl Orazbek, Yernur Nurambek et al.
In this work, we present an end-to-end deep learning framework for X-ray image diagnosis. As the first step, our system determines whether a submitted image is an X-ray or not. After it classifies the type of the X-ray, it runs the dedicated abnormality classification network. In this work, we only focus on the chest X-rays for abnormality classification. However, the system can be extended to other X-ray types easily. Our deep learning classifiers are based on DenseNet-121 architecture. The test set accuracy obtained for 'X-ray or Not', 'X-ray Type Classification', and 'Chest Abnormality Classification' tasks are 0.987, 0.976, and 0.947, respectively, resulting into an end-to-end accuracy of 0.91. For achieving better results than the state-of-the-art in the 'Chest Abnormality Classification', we utilize the new RAdam optimizer. We also use Gradient-weighted Class Activation Mapping for visual explanation of the results. Our results show the feasibility of a generalized online projectional radiography diagnosis system.
ROAug 10, 2019
Color-Coded Fiber-Optic Tactile Sensor for an Elastomeric Robot SkinZhanat Kappassov, Daulet Baimukashev, Zharaskhan Kuanyshuly et al.
The sense of touch is essential for reliable mapping between the environment and a robot which interacts physically with objects. Presumably, an artificial tactile skin would facilitate safe interaction of the robots with the environment. In this work, we present our color-coded tactile sensor, incorporating plastic optical fibers (POF), transparent silicone rubber and an off-the-shelf color camera. Processing electronics are placed away from the sensing surface to make the sensor robust to harsh environments. Contact localization is possible thanks to the lower number of light sources compared to the number of camera POFs. Classical machine learning techniques and a hierarchical classification scheme were used for contact localization. Specifically, we generated the mapping from stimulation to sensation of a robotic perception system using our sensor. We achieved a force sensing range up to 18 N with the force resolution of around 3.6~N and the spatial resolution of 8~mm. The color-coded tactile sensor is suitable for tactile exploration and might enable further innovations in robust tactile sensing.