HCNov 7, 2025
Enhancing Public Speaking Skills in Engineering Students Through AIAmol Harsh, Brainerd Prince, Siddharth Siddharth et al.
This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.
HCMay 10
AwareLLM: A Proactive Multimodal Ecosystem for Personalized Human-AI Collaboration to Enhance ProductivityAmog Rao, Utkarsh Agarwal, Amol Harsh et al.
Information workers' productivity is significantly influenced by their cognitive states and physiological responses. AI assistants such as ChatGPT, Copilot, and others have become integral components of knowledge-intensive workplaces. These AI assistants utilize pre-defined user preferences and chat interaction histories, thus confining themselves to reactive exchanges, lacking sufficient adaptability. Consequently, they fail to cater to individual user preferences and are unable to adapt to their psychophysiological states, diminishing potential productivity gains. To bridge this gap, we introduce AwareLLM, a novel multimodal framework that integrates egocentric vision, pupillometry, eye-gaze tracking, posture detection, heart activity, and the inferencing capabilities of large language models (LLMs) to create a proactive and context-aware ecosystem. AwareLLM dynamically adapts to users' psychophysiological states while analyzing temporal patterns and behavioral tendencies to provide personalized and timely interventions. We evaluated AwareLLM through a user study with 20 participants, comparing it to a standard LLM assistant across multiple tasks. Our results show statistically significant improvements in task performance, along with reductions in cognitive fatigue and mental demand. Participants described AwareLLM's personalized interventions as timely and relevant, helping them boost their confidence and deepen engagement with their work. AwareLLM opens new avenues for Human-AI collaboration where technology adapts to our needs rather than us adhering to technological constraints.
HCJan 28
GuideAI: A Real-time Personalized Learning Solution with Adaptive InterventionsAnanya Shukla, Chaitanya Modi, Satvik Bajpai et al.
Large Language Models (LLMs) have emerged as powerful learning tools, but they lack awareness of learners' cognitive and physiological states, limiting their adaptability to the user's learning style. Contemporary learning techniques primarily focus on structured learning paths, knowledge tracing, and generic adaptive testing but fail to address real-time learning challenges driven by cognitive load, attention fluctuations, and engagement levels. Building on findings from a formative user study (N=66), we introduce GuideAI, a multi-modal framework that enhances LLM-driven learning by integrating real-time biosensory feedback including eye gaze tracking, heart rate variability, posture detection, and digital note-taking behavior. GuideAI dynamically adapts learning content and pacing through cognitive optimizations (adjusting complexity based on learning progress markers), physiological interventions (breathing guidance and posture correction), and attention-aware strategies (redirecting focus using gaze analysis). Additionally, GuideAI supports diverse learning modalities, including text-based, image-based, audio-based, and video-based instruction, across varied knowledge domains. A preliminary study (N = 25) assessed GuideAI's impact on knowledge retention and cognitive load through standardized assessments. The results show statistically significant improvements in both problem-solving capability and recall-based knowledge assessments. Participants also experienced notable reductions in key NASA-TLX measures including mental demand, frustration levels, and effort, while simultaneously reporting enhanced perceived performance. These findings demonstrate GuideAI's potential to bridge the gap between current LLM-based learning systems and individualized learner needs, paving the way for adaptive, cognition-aware education at scale.
HCApr 1
FlexAI: A Multi-modal Solution for Delivering Personalized and Adaptive Fitness InterventionsShivangi Agarwal, Zoya Ghoshal, Bharat Jain et al.
Personalization of exercise routines is a crucial factor in helping people achieve their fitness goals. Despite this, many contemporary solutions fail to offer real-time, adaptive feedback tailored to an individual's physiological states. Contemporary fitness solutions often rely only on static plans and do not adjust to factors such as a user's pain thresholds, fatigue levels, or form during a workout routine. This work introduces FlexAI, a multi-modal system that integrates computer vision, physiological sensors (heart rate and voice), and the reasoning capabilities of Large Language Models (LLMs) to deliver real-time, personalized workout guidance. FlexAI continuously monitors a user's physical form and level of exertion, among other parameters, to provide dynamic interventions focused on exercise intensity, rest periods, and motivation. To validate our system, we performed a technical evaluation confirming our models' accuracy and quantifying pipeline latency, alongside an expert review where certified trainers validated the correctness of the LLM's interventions. Furthermore, in a controlled study with 25 participants, FlexAI demonstrated significant improvements over a static, non-adaptive control system. With FlexAI, users reported significantly greater enjoyment, a stronger sense of achievement, and significantly lower levels of boredom and frustration. These results indicate that by integrating multi-modal sensing with LLM-driven reasoning, adaptive systems like FlexAI can create a more engaging and effective workout experience. Our work provides a blueprint for integrating multi-modal sensing with LLM-driven reasoning, demonstrating that it is possible to create adaptive coaching systems that are not only more engaging but also demonstrably reliable.
ASJun 2, 2025
Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for HindiArnav Rustagi, Satvik Bajpai, Nimrat Kaur et al.
Computer-Assisted Pronunciation Training (CAPT) has been extensively studied for English. However, there remains a critical gap in its application to Indian languages with a base of 1.5 billion speakers. Pronunciation tools tailored to Indian languages are strikingly lacking despite the fact that millions learn them every year. With over 600 million speakers and being the fourth most-spoken language worldwide, improving Hindi pronunciation is a vital first step toward addressing this gap. This paper proposes 1) Dhvani -- a novel CAPT system for Hindi, 2) synthetic speech generation for Hindi mispronunciations, and 3) a novel methodology for providing personalized feedback to learners. While the system often interacts with learners using Devanagari graphemes, its core analysis targets phonemic distinctions, leveraging Hindi's highly phonetic orthography to analyze mispronounced speech and provide targeted feedback.
HCSep 30, 2019
On Assessing Driver Awareness of Situational Criticalities: Multi-modal Bio-sensing and Vision-based Analysis, Evaluations, and InsightsSiddharth Siddharth, Mohan M. Trivedi
Automobiles for our roadways are increasingly using advanced driver assistance systems. The adoption of such new technologies requires us to develop novel perception systems not only for accurately understanding the situational context of these vehicles, but also to infer the driver's awareness in differentiating between safe and critical situations. This manuscript focuses on the specific problem of inferring driver awareness in the context of attention analysis and hazardous incident activity. Even after the development of wearable and compact multi-modal bio-sensing systems in recent years, their application in driver awareness context has been scarcely explored. The~capability of simultaneously recording different kinds of bio-sensing data in addition to traditionally employed computer vision systems provides exciting opportunities to explore the limitations of these sensor modalities. In this work, we explore the applications of three different bio-sensing modalities namely electroencephalogram (EEG), photoplethysmogram (PPG) and galvanic skin response (GSR) along with a camera-based vision system in driver awareness context. We assess the information from these sensors independently and together using both signal processing- and deep learning-based tools. We show that our methods outperform previously reported studies to classify driver attention and detecting hazardous/non-hazardous situations for short time scales of two seconds. We use EEG and vision data for high resolution temporal classification (two seconds) while additionally also employing PPG and GSR over longer time periods. We evaluate our methods by collecting user data on twelve subjects for two real-world driving datasets among which one is publicly available (KITTI dataset) while the other was collected by us (LISA dataset) with the vehicle ....
LGMay 16, 2019
Utilizing Deep Learning Towards Multi-modal Bio-sensing and Vision-based Affective ComputingSiddharth Siddharth, Tzyy-Ping Jung, Terrence J. Sejnowski
In recent years, the use of bio-sensing signals such as electroencephalogram (EEG), electrocardiogram (ECG), etc. have garnered interest towards applications in affective computing. The parallel trend of deep-learning has led to a huge leap in performance towards solving various vision-based research problems such as object detection. Yet, these advances in deep-learning have not adequately translated into bio-sensing research. This work applies novel deep-learning-based methods to various bio-sensing and video data of four publicly available multi-modal emotion datasets. For each dataset, we first individually evaluate the emotion-classification performance obtained by each modality. We then evaluate the performance obtained by fusing the features from these modalities. We show that our algorithms outperform the results reported by other studies for emotion/valence/arousal/liking classification on DEAP and MAHNOB-HCI datasets and set up benchmarks for the newer AMIGOS and DREAMER datasets. We also evaluate the performance of our algorithms by combining the datasets and by using transfer learning to show that the proposed method overcomes the inconsistencies between the datasets. Hence, we do a thorough analysis of multi-modal affective data from more than 120 subjects and 2,800 trials. Finally, utilizing a convolution-deconvolution network, we propose a new technique towards identifying salient brain regions corresponding to various affective states.
HCApr 25, 2018
Multi-modal Approach for Affective ComputingSiddharth Siddharth, Tzyy-Ping Jung, Terrence J. Sejnowski
Throughout the past decade, many studies have classified human emotions using only a single sensing modality such as face video, electroencephalogram (EEG), electrocardiogram (ECG), galvanic skin response (GSR), etc. The results of these studies are constrained by the limitations of these modalities such as the absence of physiological biomarkers in the face-video analysis, poor spatial resolution in EEG, poor temporal resolution of the GSR etc. Scant research has been conducted to compare the merits of these modalities and understand how to best use them individually and jointly. Using multi-modal AMIGOS dataset, this study compares the performance of human emotion classification using multiple computational approaches applied to face videos and various bio-sensing modalities. Using a novel method for compensating physiological baseline we show an increase in the classification accuracy of various approaches that we use. Finally, we present a multi-modal emotion-classification approach in the domain of affective computing research.