Wasifur Rahman

LG
8papers
2,221citations
Novelty39%
AI Score27

8 Papers

CVMar 15, 2022
Auto-Gait: Automatic Ataxia Risk Assessment with Computer Vision on Gait Task Videos

Wasifur Rahman, Masum Hasan, Md Saiful Islam et al.

In this paper, we investigated whether we can 1) detect participants with ataxia-specific gait characteristics (risk-prediction), and 2) assess severity of ataxia from gait (severity-assessment) using computer vision. We created a dataset of 155 videos from 89 participants, 24 controls and 65 diagnosed with (or are pre-manifest) spinocerebellar ataxias (SCAs), performing the gait task of the Scale for the Assessment and Rating of Ataxia (SARA) from 11 medical sites located in 8 different states across the United States. We develop a computer vision pipeline to detect, track, and separate out the participants from their surroundings and construct several features from their body pose coordinates to capture gait characteristics like step width, step length, swing, stability, speed, etc. Our risk-prediction model achieves 83.06% accuracy and an 80.23% F1 score. Similarly, our severity-assessment model achieves a mean absolute error (MAE) score of 0.6225 and a Pearson's correlation coefficient score of 0.7268. Our models still performed competitively when evaluated on data from sites not used during training. Furthermore, through feature importance analysis, we found that our models associate wider steps, decreased walking speed, and increased instability with greater ataxia severity, which is consistent with previously established clinical knowledge. Our models create possibilities for remote ataxia assessment in non-clinical settings in the future, which could significantly improve accessibility of ataxia care. Furthermore, our underlying dataset was assembled from a geographically diverse cohort, highlighting its potential to further increase equity. The code used in this study is open to the public, and the anonymized body pose landmark dataset is also available upon request.

LGMar 30, 2023
Using AI to Measure Parkinson's Disease Severity at Home

Md Saiful Islam, Wasifur Rahman, Abdelrahman Abdelkader et al.

We present an artificial intelligence system to remotely assess the motor performance of individuals with Parkinson's disease (PD). Participants performed a motor task (i.e., tapping fingers) in front of a webcam, and data from 250 global participants were rated by three expert neurologists following the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). The neurologists' ratings were highly reliable, with an intra-class correlation coefficient (ICC) of 0.88. We developed computer algorithms to obtain objective measurements that align with the MDS-UPDRS guideline and are strongly correlated with the neurologists' ratings. Our machine learning model trained on these measures outperformed an MDS-UPDRS certified rater, with a mean absolute error (MAE) of 0.59 compared to the rater's MAE of 0.79. However, the model performed slightly worse than the expert neurologists (0.53 MAE). The methodology can be replicated for similar motor tasks, providing the possibility of evaluating individuals with PD and other movement disorders remotely, objectively, and in areas with limited access to neurological care.

IVAug 3, 2023
Unmasking Parkinson's Disease with Smile: An AI-enabled Screening Framework

Tariq Adnan, Md Saiful Islam, Wasifur Rahman et al.

We present an efficient and accessible PD screening method by leveraging AI-driven models enabled by the largest video dataset of facial expressions from 1,059 unique participants. This dataset includes 256 individuals with PD, 165 clinically diagnosed, and 91 self-reported. Participants used webcams to record themselves mimicking three facial expressions (smile, disgust, and surprise) from diverse sources encompassing their homes across multiple countries, a US clinic, and a PD wellness center in the US. Facial landmarks are automatically tracked from the recordings to extract features related to hypomimia, a prominent PD symptom characterized by reduced facial expressions. Machine learning algorithms are trained on these features to distinguish between individuals with and without PD. The model was tested for generalizability on external (unseen during training) test videos collected from a US clinic and Bangladesh. An ensemble of machine learning models trained on smile videos achieved an accuracy of 87.9+-0.1% (95% Confidence Interval) with an AUROC of 89.3+-0.3% as evaluated on held-out data (using k-fold cross-validation). In external test settings, the ensemble model achieved 79.8+-0.6% accuracy with 81.9+-0.3% AUROC on the clinical test set and 84.9+-0.4% accuracy with 81.2+-0.6% AUROC on participants from Bangladesh. In every setting, the model was free from detectable bias across sex and ethnic subgroups, except in the cohorts from Bangladesh, where the model performed significantly better for female participants than males. Smiling videos can effectively differentiate between individuals with and without PD, offering a potentially easy, accessible, and cost-efficient way to screen for PD, especially when a clinical diagnosis is difficult to access.

CLMar 27, 2023
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

Md Kamrul Hasan, Md Saiful Islam, Sangwu Lee et al.

Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.

DBMar 26, 2021
DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate Speeches

Taylan K. Sen, Gazi Naven, Luke Gerstner et al.

In this work, we present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC). Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities of raw video recordings of competitive collegiate debaters (N=717 6-minute recordings from 140 unique debaters). Each speech has an associated competition debate score (range: 67-96) from expert judges as well as competitor demographic and per-round reflection surveys. We observe the fully multimodal model performs best in comparison to models trained on various compositions of modalities. We also find that the weights of some features (such as the expression of joy and the use of the word we) change in direction between the aforementioned models. We use these results to highlight the value of a multimodal dataset for studying competitive, collegiate debate.

ASSep 2, 2020
Detecting Parkinson's Disease From an Online Speech-task

Wasifur Rahman, Sangwu Lee, Md. Saiful Islam et al.

In this paper, we envision a web-based framework that can help anyone, anywhere around the world record a short speech task, and analyze the recorded data to screen for Parkinson's disease (PD). We collected data from 726 unique participants (262 PD, 38% female; 464 non-PD, 65% female; average age: 61) -- from all over the US and beyond. A small portion of the data was collected in a lab setting to compare quality. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet "the quick brown fox jumps over the lazy dog..". We extracted both standard acoustic features (Mel Frequency Cepstral Coefficients (MFCC), jitter and shimmer variants) and deep learning based features from the speech data. Using these features, we trained several machine learning algorithms. We achieved 0.75 AUC (Area Under The Curve) performance on determining presence of self-reported Parkinson's disease by modeling the standard acoustic features through the XGBoost -- a gradient-boosted decision tree model. Further analysis reveal that the widely used MFCC features and a subset of previously validated dysphonia features designed for detecting Parkinson's from verbal phonation task (pronouncing 'ahh') contains the most distinct information. Our model performed equally well on data collected in controlled lab environment as well as 'in the wild' across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with a video/audio enabled device, contributing to equity and access in neurological care.

LGAug 15, 2019
Integrating Multimodal Information in Large Pretrained Transformers

Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee et al.

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.

LGApr 14, 2019
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh et al.

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.