CVMar 15, 2022
Auto-Gait: Automatic Ataxia Risk Assessment with Computer Vision on Gait Task VideosWasifur Rahman, Masum Hasan, Md Saiful Islam et al.
In this paper, we investigated whether we can 1) detect participants with ataxia-specific gait characteristics (risk-prediction), and 2) assess severity of ataxia from gait (severity-assessment) using computer vision. We created a dataset of 155 videos from 89 participants, 24 controls and 65 diagnosed with (or are pre-manifest) spinocerebellar ataxias (SCAs), performing the gait task of the Scale for the Assessment and Rating of Ataxia (SARA) from 11 medical sites located in 8 different states across the United States. We develop a computer vision pipeline to detect, track, and separate out the participants from their surroundings and construct several features from their body pose coordinates to capture gait characteristics like step width, step length, swing, stability, speed, etc. Our risk-prediction model achieves 83.06% accuracy and an 80.23% F1 score. Similarly, our severity-assessment model achieves a mean absolute error (MAE) score of 0.6225 and a Pearson's correlation coefficient score of 0.7268. Our models still performed competitively when evaluated on data from sites not used during training. Furthermore, through feature importance analysis, we found that our models associate wider steps, decreased walking speed, and increased instability with greater ataxia severity, which is consistent with previously established clinical knowledge. Our models create possibilities for remote ataxia assessment in non-clinical settings in the future, which could significantly improve accessibility of ataxia care. Furthermore, our underlying dataset was assembled from a geographically diverse cohort, highlighting its potential to further increase equity. The code used in this study is open to the public, and the anonymized body pose landmark dataset is also available upon request.
CLAug 18, 2022
A Survey on Open Information Extraction from Rule-based Model to Large Language ModelPai Liu, Wenyang Gao, Wenjie Dong et al.
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, the paper outlines potential future directions in terms of datasets, information sources, output formats, methodologies, and evaluation metrics.
LGMar 30, 2023
Using AI to Measure Parkinson's Disease Severity at HomeMd Saiful Islam, Wasifur Rahman, Abdelrahman Abdelkader et al.
We present an artificial intelligence system to remotely assess the motor performance of individuals with Parkinson's disease (PD). Participants performed a motor task (i.e., tapping fingers) in front of a webcam, and data from 250 global participants were rated by three expert neurologists following the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). The neurologists' ratings were highly reliable, with an intra-class correlation coefficient (ICC) of 0.88. We developed computer algorithms to obtain objective measurements that align with the MDS-UPDRS guideline and are strongly correlated with the neurologists' ratings. Our machine learning model trained on these measures outperformed an MDS-UPDRS certified rater, with a mean absolute error (MAE) of 0.59 compared to the rater's MAE of 0.79. However, the model performed slightly worse than the expert neurologists (0.53 MAE). The methodology can be replicated for similar motor tasks, providing the possibility of evaluating individuals with PD and other movement disorders remotely, objectively, and in areas with limited access to neurological care.
HCAug 6, 2023
SAPIEN: Affective Virtual Agents Powered by Large Language ModelsMasum Hasan, Cengiz Ozel, Sammy Potter et al.
In this demo paper, we introduce SAPIEN, a platform for high-fidelity virtual agents driven by large language models that can hold open domain conversations with users in 13 different languages, and display emotions through facial expressions and voice. The platform allows users to customize their virtual agent's personality, background, and conversation premise, thus providing a rich, immersive interaction experience. Furthermore, after the virtual meeting, the user can choose to get the conversation analyzed and receive actionable feedback on their communication skills. This paper illustrates an overview of the platform and discusses the various application domains of this technology, ranging from entertainment to mental health, communication training, language learning, education, healthcare, and beyond. Additionally, we consider the ethical implications of such realistic virtual agent representations and the potential challenges in ensuring responsible use.
HCApr 20
Design and Evaluation of a Culturally Adapted Multimodal Virtual Agent for PTSD ScreeningCengiz Ozel, Waleed Nadeem, Samuel Potter et al.
Post-traumatic stress disorder (PTSD) is highly prevalent yet chronically underreported among combat-exposed military personnel. This paper presents Molhim, a culturally adapted multimodal conversational AI platform that supports purpose-specific interactions through a configurable conversational pipeline consisting of session setup, real-time dialogue with a high-fidelity virtual avatar, and post-session analysis and feedback. In this work, we examine the PTSD screening configuration of the Molhim platform in a military healthcare context. The system employs a conversational avatar driven by a large language model, integrating real-time speech recognition, visual understanding of user input, text-to-speech synthesis, and a high-fidelity human avatar to support structured multi-turn dialogue and automated post-session analysis, including administration of the PTSD Checklist for DSM-5 (PCL-5). These findings suggest the feasibility of Molhim as a conversational platform for PTSD screening and highlight design considerations for socially cooperative human-AI systems in clinical environments.
IVAug 3, 2023
Unmasking Parkinson's Disease with Smile: An AI-enabled Screening FrameworkTariq Adnan, Md Saiful Islam, Wasifur Rahman et al.
We present an efficient and accessible PD screening method by leveraging AI-driven models enabled by the largest video dataset of facial expressions from 1,059 unique participants. This dataset includes 256 individuals with PD, 165 clinically diagnosed, and 91 self-reported. Participants used webcams to record themselves mimicking three facial expressions (smile, disgust, and surprise) from diverse sources encompassing their homes across multiple countries, a US clinic, and a PD wellness center in the US. Facial landmarks are automatically tracked from the recordings to extract features related to hypomimia, a prominent PD symptom characterized by reduced facial expressions. Machine learning algorithms are trained on these features to distinguish between individuals with and without PD. The model was tested for generalizability on external (unseen during training) test videos collected from a US clinic and Bangladesh. An ensemble of machine learning models trained on smile videos achieved an accuracy of 87.9+-0.1% (95% Confidence Interval) with an AUROC of 89.3+-0.3% as evaluated on held-out data (using k-fold cross-validation). In external test settings, the ensemble model achieved 79.8+-0.6% accuracy with 81.9+-0.3% AUROC on the clinical test set and 84.9+-0.4% accuracy with 81.2+-0.6% AUROC on participants from Bangladesh. In every setting, the model was free from detectable bias across sex and ethnic subgroups, except in the cohorts from Bangladesh, where the model performed significantly better for female participants than males. Smiling videos can effectively differentiate between individuals with and without PD, offering a potentially easy, accessible, and cost-efficient way to screen for PD, especially when a clinical diagnosis is difficult to access.
CLMar 27, 2023
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language ModelsMd Kamrul Hasan, Md Saiful Islam, Sangwu Lee et al.
Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.
LGDec 21, 2022
NADBenchmarks -- a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural DisastersAdiba Mahbub Proma, Md Saiful Islam, Stela Ciko et al.
Climate change has increased the intensity, frequency, and duration of extreme weather events and natural disasters across the world. While the increased data on natural disasters improves the scope of machine learning (ML) in this field, progress is relatively slow. One bottleneck is the lack of benchmark datasets that would allow ML researchers to quantify their progress against a standard metric. The objective of this short paper is to explore the state of benchmark datasets for ML tasks related to natural disasters, categorizing them according to the disaster management cycle. We compile a list of existing benchmark datasets introduced in the past five years. We propose a web platform - NADBenchmarks - where researchers can search for benchmark datasets for natural disasters, and we develop a preliminary version of such a platform using our compiled list. This paper is intended to aid researchers in finding benchmark datasets to train their ML models on, and provide general directions for topics where they can contribute new benchmark datasets.
HCJan 22
Replicating Human Motivated Reasoning Studies with LLMsNeeley Pate, Adiba Mahbub Proma, Hangfeng He et al.
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
ASMar 11
Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech AssessmentAsif Azad, MD Sadik Hossain Shanto, Mohammad Sadat Hossain et al.
Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92\% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show Harf-Speech yields clinically aligned, interpretable scores comparable to inter-rater expert agreement.
AIFeb 11
OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy OptimizationKeane Ong, Sabri Boughorbel, Luwei Xiao et al.
To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.
AIJul 15, 2022
A Flexible Schema-Guided Dialogue Management Framework: From Friendly Peer to Virtual Standardized Cancer PatientBenjamin Kane, Catherine Giugno, Lenhart Schubert et al.
A schema-guided approach to dialogue management has been shown in recent work to be effective in creating robust customizable virtual agents capable of acting as friendly peers or task assistants. However, successful applications of these methods in open-ended, mixed-initiative domains remain elusive -- particularly within medical domains such as virtual standardized patients, where such complex interactions are commonplace -- and require more extensive and flexible dialogue management capabilities than previous systems provide. In this paper, we describe a general-purpose schema-guided dialogue management framework used to develop SOPHIE, a virtual standardized cancer patient that allows a doctor to conveniently practice for interactions with patients. We conduct a crowdsourced evaluation of conversations between medical students and SOPHIE. Our agent is judged to produce responses that are natural, emotionally appropriate, and consistent with her role as a cancer patient. Furthermore, it significantly outperforms an end-to-end neural model fine-tuned on a human standardized patient corpus, attesting to the advantages of a schema-guided approach.
HCNov 21, 2023
PARK: Parkinson's Analysis with Remote Kinetic-tasksMd Saiful Islam, Sangwu Lee, Abdelrahman Abdelkader et al.
We present a web-based framework to screen for Parkinson's disease (PD) by allowing users to perform neurological tests in their homes. Our web framework guides the users to complete three tasks involving speech, facial expression, and finger movements. The task videos are analyzed to classify whether the users show signs of PD. We present the results in an easy-to-understand manner, along with personalized resources to further access to treatment and care. Our framework is accessible by any major web browser, improving global access to neurological care.
CVDec 10, 2025
VisualActBench: Can VLMs See and Act like a Human?Daoan Zhang, Pai Liu, Xiaofei Zhou et al.
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
CVFeb 13
Benchmarking Video Foundation Models for Remote Parkinson's Disease ScreeningMd Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader et al.
Video-based assessments offer a scalable pathway for remote Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4 - 85.3% and accuracies of 71.5 - 80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2 - 57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5
SIMay 5
Can LLMs Emulate Human Belief Dynamics?Adiba Mahbub Proma, Neeley Pate, James N. Druckman et al.
Can LLMs simulate how humans form and change beliefs in social networks? We put this to the test by replicating an established study on belief dynamics, evaluating 12 LLMs across multiple model families and parameter sizes. The answer is a clear no, and in systematic ways. LLMs fail to capture initial human belief distributions and tend to be overall more conformist than humans, shifting their responses to align with those around them. They also take a nuanced approach to emulating human homophilic tendencies within networks. Our findings carry a double payoff: they highlight fundamental properties of LLM behavior, and they raise a sharp warning against deploying LLMs as human proxies in social simulations.
SDMay 21, 2024
A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech EmbeddingsTariq Adnan, Abdelrahman Abdelkader, Zipei Liu et al.
We present a framework to recognize Parkinson's disease (PD) through an English pangram utterance speech collected using a web application from diverse recording settings and environments, including participants' homes. Our dataset includes a global cohort of 1306 participants, including 392 diagnosed with PD. Leveraging the diversity of the dataset, spanning various demographic properties (such as age, sex, and ethnicity), we used deep learning embeddings derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind representing the speech dynamics associated with PD. Our novel fusion model for PD classification, which aligns different speech embeddings into a cohesive feature space, demonstrated superior performance over standard concatenation-based fusion models and other baselines (including models built on traditional acoustic features). In a randomized data split configuration, the model achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis confirmed that our model performs equitably across various demographic subgroups in terms of sex, ethnicity, and age, and remains robust regardless of disease duration. Furthermore, our model, when tested on two entirely unseen test datasets collected from clinical settings and from a PD care center, maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the model's robustness and it's potential to enhance accessibility and health equity in real-world applications.
AIOct 20, 2024
AI Can Enhance Creativity in Social NetworksRaiyan Abdul Baten, Ali Sarosh Bangash, Krish Veera et al.
Can peer recommendation engines elevate people's creative performances in self-organizing social networks? Answering this question requires resolving challenges in data collection (e.g., tracing inspiration links and psycho-social attributes of nodes) and intervention design (e.g., balancing idea stimulation and redundancy in evolving information environments). We trained a model that predicts people's ideation performances using semantic and network-structural features in an online platform. Using this model, we built SocialMuse, which maximizes people's predicted performances to generate peer recommendations for them. We found treatment networks leveraging SocialMuse outperforming AI-agnostic control networks in several creativity measures. The treatment networks were more decentralized than the control, as SocialMuse increasingly emphasized network-structural features at large network sizes. This decentralization spreads people's inspiration sources, helping inspired ideas stand out better. Our study provides actionable insights into building intelligent systems for elevating creativity.
CLFeb 28, 2025
How LLMs Fail to Support Fact-CheckingAdiba Mahbub Proma, Neeley Pate, James Druckman et al.
While Large Language Models (LLMs) can amplify online misinformation, they also show promise in tackling misinformation. In this paper, we empirically study the capabilities of three LLMs -- ChatGPT, Gemini, and Claude -- in countering political misinformation. We implement a two-step, chain-of-thought prompting approach, where models first identify credible sources for a given claim and then generate persuasive responses. Our findings suggest that models struggle to ground their responses in real news sources, and tend to prefer citing left-leaning sources. We also observe varying degrees of response diversity among models. Our findings highlight concerns about using LLMs for fact-checking through only prompt-engineering, emphasizing the need for more robust guardrails. Our results have implications for both researchers and non-technical users.
HCMay 5, 2025
AI Standardized Patient Improves Human Conversations in Advanced Cancer CareKurtis Haut, Masum Hasan, Thomas Carroll et al.
Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation and automated feedback system. SOPHIE combines large language models (LLMs), a lifelike virtual avatar, and automated, personalized feedback based on clinical literature to provide remote, on-demand SIC training. In a randomized control study with healthcare students and professionals, SOPHIE users demonstrated significant improvement across three critical SIC domains: Empathize, Be Explicit, and Empower. These results suggest that AI-driven tools can enhance complex interpersonal communication skills, offering scalable, accessible solutions to address a critical gap in clinician education.
CVJun 21, 2024
Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video AnalysisMd Saiful Islam, Tariq Adnan, Jan Freyberg et al.
Limited accessibility to neurological care leads to underdiagnosed Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD detection methods primarily focus on unimodal analysis of motor or speech tasks, overlooking the multifaceted nature of the disease. To address this, we introduce a large-scale, multi-task video dataset consisting of 1102 sessions (each containing videos of finger tapping, facial expression, and speech tasks captured via webcam) from 845 participants (272 with PD). We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy. UFNet employs independent task-specific networks, trained with Monte Carlo Dropout for uncertainty quantification, followed by self-attended fusion of features, with attention weights dynamically adjusted based on task-specific uncertainties. To ensure patient-centered evaluation, the participants were randomly split into three sets: 60% for training, 20% for model selection, and 20% for final performance evaluation. UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity. Withholding uncertain predictions further boosted the performance, achieving 88.0+-0.3%$ accuracy, 93.0+-0.2% AUROC, 79.3+-0.9% sensitivity, and 92.6+-0.3% specificity, at the expense of not being able to predict for 2.3+-0.3% data (+- denotes 95% confidence interval). Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. Requiring only a webcam and microphone, our approach facilitates accessible home-based PD screening, especially in regions with limited healthcare resources.
CVJun 5, 2024
Hi5: Synthetic Data for Inclusive, Robust, Hand Pose EstimationMasum Hasan, Cengiz Ozel, Nina Long et al.
Hand pose estimation plays a vital role in capturing subtle nonverbal cues essential for understanding human affect. However, collecting diverse, expressive real-world data remains challenging due to labor-intensive manual annotation that often underrepresents demographic diversity and natural expressions. To address this issue, we introduce a cost-effective approach to generating synthetic data using high-fidelity 3D hand models and a wide range of affective hand poses. Our method includes varied skin tones, genders, dynamic environments, realistic lighting conditions, and diverse naturally occurring gesture animations. The resulting dataset, Hi5, contains 583,000 pose-annotated images, carefully balanced to reflect natural diversity and emotional expressiveness. Models trained exclusively on Hi5 achieve performance comparable to human-annotated datasets, exhibiting superior robustness to occlusions and consistent accuracy across diverse skin tones -- which is crucial for reliably recognizing expressive gestures in affective computing applications. Our results demonstrate that synthetic data effectively addresses critical limitations of existing datasets, enabling more inclusive, expressive, and reliable gesture recognition systems while achieving competitive performance in pose estimation benchmarks. The Hi5 dataset, data synthesis pipeline, source code, and game engine project are publicly released to support further research in synthetic hand-gesture applications.
CVDec 10, 2023
PULSAR: Graph based Positive Unlabeled Learning with Multi Stream Adaptive Convolutions for Parkinson's Disease RecognitionMd. Zarif Ul Alam, Md Saiful Islam, Ehsan Hoque et al.
Parkinson's disease (PD) is a neuro-degenerative disorder that affects movement, speech, and coordination. Timely diagnosis and treatment can improve the quality of life for PD patients. However, access to clinical diagnosis is limited in low and middle income countries (LMICs). Therefore, development of automated screening tools for PD can have a huge social impact, particularly in the public health sector. In this paper, we present PULSAR, a novel method to screen for PD from webcam-recorded videos of the finger-tapping task from the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS). PULSAR is trained and evaluated on data collected from 382 participants (183 self-reported as PD patients). We used an adaptive graph convolutional neural network to dynamically learn the spatio temporal graph edges specific to the finger-tapping task. We enhanced this idea with a multi stream adaptive convolution model to learn features from different modalities of data critical to detect PD, such as relative location of the finger joints, velocity and acceleration of tapping. As the labels of the videos are self-reported, there could be cases of undiagnosed PD in the non-PD labeled samples. We leveraged the idea of Positive Unlabeled (PU) Learning that does not need labeled negative data. Our experiments show clear benefit of modeling the problem in this way. PULSAR achieved 80.95% accuracy in validation set and a mean accuracy of 71.29% (2.49% standard deviation) in independent test, despite being trained with limited amount of data. This is specially promising as labeled data is scarce in health care sector. We hope PULSAR will make PD screening more accessible to everyone. The proposed techniques could be extended for assessment of other movement disorders, such as ataxia, and Huntington's disease.
DBMar 26, 2021
DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate SpeechesTaylan K. Sen, Gazi Naven, Luke Gerstner et al.
In this work, we present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC). Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities of raw video recordings of competitive collegiate debaters (N=717 6-minute recordings from 140 unique debaters). Each speech has an associated competition debate score (range: 67-96) from expert judges as well as competitor demographic and per-round reflection surveys. We observe the fully multimodal model performs best in comparison to models trained on various compositions of modalities. We also find that the weights of some features (such as the expression of joy and the use of the word we) change in direction between the aforementioned models. We use these results to highlight the value of a multimodal dataset for studying competitive, collegiate debate.
CYFeb 16, 2021
A Mental Trespass? Unveiling Truth, Exposing Thoughts and Threatening Civil Liberties with Non-Invasive AI Lie DetectionTaylan Sen, Kurtis Haut, Denis Lomakin et al.
Imagine an app on your phone or computer that can tell if you are being dishonest, just by processing affective features of your facial expressions, body movements, and voice. People could ask about your political preferences, your sexual orientation, and immediately determine which of your responses are honest and which are not. In this paper we argue why artificial intelligence-based, non-invasive lie detection technologies are likely to experience a rapid advancement in the coming years, and that it would be irresponsible to wait any longer before discussing its implications. Legal and popular perspectives are reviewed to evaluate the potential for these technologies to cause societal harm. To understand the perspective of a reasonable person, we conducted a survey of 129 individuals, and identified consent and accuracy as the major factors in their decision-making process regarding the use of these technologies. In our analysis, we distinguish two types of lie detection technology, accurate truth metering and accurate thought exposing. We generally find that truth metering is already largely within the scope of existing US federal and state laws, albeit with some notable exceptions. In contrast, we find that current regulation of thought exposing technologies is ambiguous and inadequate to safeguard civil liberties. In order to rectify these shortcomings, we introduce the legal concept of mental trespass and use this concept as the basis for proposed regulation.
AIDec 11, 2020
Fairness in Rating Prediction by Awareness of Verbal and Gesture Quality of Public SpeechesAnkani Chattoraj, Rupam Acharyya, Shouman Das et al.
The role of verbal and non-verbal cues towards great public speaking has been a topic of exploration for many decades. We identify a commonality across present theories, the element of "variety or heterogeneity" in channels or modes of communication (e.g. resorting to stories, scientific facts, emotional connections, facial expressions etc.) which is essential for effectively communicating information. We use this observation to formalize a novel HEterogeneity Metric, HEM, that quantifies the quality of a talk both in the verbal and non-verbal domain (transcript and facial gestures). We use TED talks as an input repository of public speeches because it consists of speakers from a diverse community besides having a wide outreach. We show that there is an interesting relationship between HEM and the ratings of TED talks given to speakers by viewers. It emphasizes that HEM inherently and successfully represents the quality of a talk based on "variety or heterogeneity". Further, we also discover that HEM successfully captures the prevalent bias in ratings with respect to race and gender, that we call sensitive attributes (because prediction based on these might result in unfair outcome). We incorporate the HEM metric into the loss function of a neural network with the goal to reduce unfairness in rating predictions with respect to race and gender. Our results show that the modified loss function improves fairness in prediction without considerably affecting prediction accuracy of the neural network. Our work ties together a novel metric for public speeches in both verbal and non-verbal domain with the computational power of a neural network to design a fair prediction system for speakers.
HCDec 9, 2020
Facial expressions can detect Parkinson's disease: preliminary evidence from videos collected onlineMohammad Rafayet Ali, Taylor Myers, Ellen Wagner et al.
One of the symptoms of Parkinson's disease (PD) is hypomimia or reduced facial expressions. In this paper, we present a digital biomarker for PD that utilizes the study of micro-expressions. We analyzed the facial action units (AU) from 1812 videos of 604 individuals (61 with PD and 543 without PD, mean age 63.9 yo, sd 7.8 ) collected online using a web-based tool (www.parktest.net). In these videos, participants were asked to make three facial expressions (a smiling, disgusted, and surprised face) followed by a neutral face. Using techniques from computer vision and machine learning, we objectively measured the variance of the facial muscle movements and used it to distinguish between individuals with and without PD. The prediction accuracy using the facial micro-expressions was comparable to those methodologies that utilize motor symptoms. Logistic regression analysis revealed that participants with PD had less variance in AU6 (cheek raiser), AU12 (lip corner puller), and AU4 (brow lowerer) than non-PD individuals. An automated classifier using Support Vector Machine was trained on the variances and achieved 95.6% accuracy. Using facial expressions as a biomarker for PD could be potentially transformative for patients in need of physical separation (e.g., due to COVID) or are immobile.
HCDec 8, 2020
Technology-driven Alteration of Nonverbal Cues and its Effects on NegotiationRaiyan Abdul Baten, Ehsan Hoque
A person's appearance, identity, and other nonverbal cues can substantially influence how one is perceived by a negotiation counterpart, potentially impacting the outcome of the negotiation. With recent advances in technology, it is now possible to alter such cues through real-time video communication. In many cases, a person's physical presence can explicitly be replaced by 2D/3D representations in live interactive media. In other cases, technologies such as deepfake can subtly and implicitly alter many nonverbal cues -- including a person's appearance and identity -- in real-time. In this article, we look at some state-of-the-art technological advances that can enable such explicit and implicit alteration of nonverbal cues. We also discuss the implications of such technology for the negotiation landscape and highlight ethical considerations that warrant deep, ongoing attention from stakeholders.
HCNov 12, 2020
Immediate or Reflective?: Effects of Real-timeFeedback on Group Discussions over VideochatSamiha Samrose, Reza Rawassizadeh, Ehsan Hoque
Having a group discussion with the members holding conflicting viewpoints is difficult. It is especially challenging for machine-mediated discussions in which the subtle social cues are hard to notice. We present a fully automated videochat framework that can automatically analyze audio-video data of the participants and provide real-time feedback on participation, interruption, volume, and facial emotion. In a heated discourse, these features are especially aligned with the undesired characteristics of dominating the conversation without taking turns, interrupting constantly, raising voice, and expressing negative emotion. We conduct a treatment-control user study with 40 participants having 20 sessions in total. We analyze the immediate and the reflective effects of real-time feedback on participants. Our findings show that while real-time feedback can make the ongoing discussion significantly less spontaneous, its effects propagate to successive sessions bringing significantly more expressiveness to the team. Our explorations with instant and propagated impacts of real-time feedback can be useful for developing design strategies for various collaborative environments.
CYOct 28, 2020
Detecting Individuals with Depressive Disorder fromPersonal Google Search and YouTube History LogsBoyu Zhang, Anis Zaman, Rupam Acharyya et al.
Depressive disorder is one of the most prevalent mental illnesses among the global population. However, traditional screening methods require exacting in-person interviews and may fail to provide immediate interventions. In this work, we leverage ubiquitous personal longitudinal Google Search and YouTube engagement logs to detect individuals with depressive disorder. We collected Google Search and YouTube history data and clinical depression evaluation results from $212$ participants ($99$ of them suffered from moderate to severe depressions). We then propose a personalized framework for classifying individuals with and without depression symptoms based on mutual-exciting point process that captures both the temporal and semantic aspects of online activities. Our best model achieved an average F1 score of $0.77 \pm 0.04$ and an AUC ROC of $0.81 \pm 0.02$.
HCSep 23, 2020
Novel Computational Linguistic Measures, Dialogue System and the Development of SOPHIE: Standardized Online Patient for Healthcare Interaction EducationMohammad Rafayet Ali, Taylan Sen, Benjamin Kane et al.
In this paper, we describe the iterative participatory design of SOPHIE, an online virtual patient for feedback-based practice of sensitive patient-physician conversations, and discuss an initial qualitative evaluation of the system by professional end users. The design of SOPHIE was motivated from a computational linguistic analysis of the transcripts of 383 patient-physician conversations from an essential office visit of late stage cancer patients with their oncologists. We developed methods for the automatic detection of two behavioral paradigms, lecturing and positive language usage patterns (sentiment trajectory of conversation), that are shown to be significantly associated with patient prognosis understanding. These automated metrics associated with effective communication were incorporated into SOPHIE, and a pilot user study identified that SOPHIE was favorably reviewed by a user group of practicing physicians.
CYSep 5, 2020
The Relationship between Deteriorating Mental Health Conditions and Longitudinal Behavioral Changes in Google and YouTube Usages among College Students in the United States during COVID-19: Observational StudyAnis Zaman, Boyu Zhang, Ehsan Hoque et al.
Mental health problems among the global population are worsened during the coronavirus disease (COVID-19). How individuals engage with online platforms such as Google Search and YouTube undergoes drastic shifts due to pandemic and subsequent lockdowns. Such ubiquitous daily behaviors on online platforms have the potential to capture and correlate with clinically alarming deteriorations in mental health profiles in a non-invasive manner. The goal of this study is to examine, among college students, the relationship between deteriorating mental health conditions and changes in user behaviors when engaging with Google Search and YouTube during COVID-19. This study recruited a cohort of 49 students from a U.S. college campus during January 2020 (prior to the pandemic) and measured the anxiety and depression levels of each participant. This study followed up with the same cohort during May 2020 (during the pandemic), and the anxiety and depression levels were assessed again. The longitudinal Google Search and YouTube history data were anonymized and collected. From individual-level Google Search and YouTube histories, we developed 5 signals that can quantify shifts in online behaviors during the pandemic. We then assessed the differences between groups with and without deteriorating mental health profiles in terms of these features. Significant features included late-night online activities, continuous usages, and time away from the internet, porn consumptions, and keywords associated with negative emotions, social activities, and personal affairs. Though further studies are required, our results demonstrated the feasibility of utilizing pervasive online data to establish non-invasive surveillance systems for mental health conditions that bypasses many disadvantages of existing screening methods.
ASSep 2, 2020
Detecting Parkinson's Disease From an Online Speech-taskWasifur Rahman, Sangwu Lee, Md. Saiful Islam et al.
In this paper, we envision a web-based framework that can help anyone, anywhere around the world record a short speech task, and analyze the recorded data to screen for Parkinson's disease (PD). We collected data from 726 unique participants (262 PD, 38% female; 464 non-PD, 65% female; average age: 61) -- from all over the US and beyond. A small portion of the data was collected in a lab setting to compare quality. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet "the quick brown fox jumps over the lazy dog..". We extracted both standard acoustic features (Mel Frequency Cepstral Coefficients (MFCC), jitter and shimmer variants) and deep learning based features from the speech data. Using these features, we trained several machine learning algorithms. We achieved 0.75 AUC (Area Under The Curve) performance on determining presence of self-reported Parkinson's disease by modeling the standard acoustic features through the XGBoost -- a gradient-boosted decision tree model. Further analysis reveal that the widely used MFCC features and a subset of previously validated dysphonia features designed for detecting Parkinson's from verbal phonation task (pronouncing 'ahh') contains the most distinct information. Our model performed equally well on data collected in controlled lab environment as well as 'in the wild' across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with a video/audio enabled device, contributing to equity and access in neurological care.
HCJul 1, 2020
Individual-level Anxiety Detection and Prediction from Longitudinal YouTube and Google Search Engagement LogsAnis Zaman, Boyu Zhang, Vincent Silenzio et al.
Anxiety disorder is one of the world's most prevalent mental health conditions, arising from complex interactions of biological and environmental factors and severely interfering one's ability to lead normal life activities. Current methods for detecting anxiety heavily rely on in-person interviews, which can be expensive, time-consuming, and blocked by social stigmas. In this work, we propose an alternative method to identify individuals with anxiety and further estimate their levels of anxiety using personal online activity histories from YouTube and the Google Search engine, platforms that are used by millions of people daily. We ran a longitudinal study and collected multiple rounds of anonymized YouTube and Google Search logs from volunteering participants, along with their clinically validated ground-truth anxiety assessment scores. We then developed explainable features that capture both the temporal and contextual aspects of online behaviors. Using those, we were able to train models that (i) identify individuals having anxiety disorder with an average F1 score of 0.83 and (ii) assess the level of anxiety by predicting the gold standard Generalized Anxiety Disorder 7-item scores (ranges from 0 to 21) with a mean square error of 1.87 based on the ubiquitous individual-level online engagement data. Our proposed anxiety assessment framework is cost-effective, time-saving, scalable, and opens the door for it to be deployed in real-world clinical settings, empowering care providers and therapists to learn about anxiety disorders of patients non-invasively at any moment in time.
LGAug 15, 2019
Integrating Multimodal Information in Large Pretrained TransformersWasifur Rahman, Md. Kamrul Hasan, Sangwu Lee et al.
Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.