CLMar 23, 2023
GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPTNan Gao, Zeyu Zhao, Zhi Zeng et al.
Gesture synthesis has gained significant attention as a critical research field, aiming to produce contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. In this letter, we propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of large language models , such as ChatGPT. By capitalizing on the strengths of LLMs for text analysis, we adopt a controlled approach to generate and integrate professional gestures and base gestures through a text parsing script, resulting in diverse and meaningful gestures. Firstly, our approach involves the development of prompt principles that transform gesture generation into an intention classification problem using ChatGPT. We also conduct further analysis on emphasis words and semantic words to aid in gesture generation. Subsequently, we construct a specialized gesture lexicon with multiple semantic annotations, decoupling the synthesis of gestures into professional gestures and base gestures. Finally, we merge the professional gestures with base gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures.
CLMay 16, 2022
A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge DevicesLiang Huang, Senjie Liang, Feiyang Ye et al.
Intent detection and slot filling are two main tasks in natural language understanding and play an essential role in task-oriented dialogue systems. The joint learning of both tasks can improve inference accuracy and is popular in recent works. However, most joint models ignore the inference latency and cannot meet the need to deploy dialogue systems at the edge. In this paper, we propose a Fast Attention Network (FAN) for joint intent detection and slot filling tasks, guaranteeing both accuracy and latency. Specifically, we introduce a clean and parameter-refined attention module to enhance the information exchange between intent and slot, improving semantic accuracy by more than 2%. FAN can be implemented on different encoders and delivers more accurate models at every speed level. Our experiments on the Jetson Nano platform show that FAN inferences fifteen utterances per second with a small accuracy drop, showing its effectiveness and efficiency on edge devices.
HCJul 1, 2025
Customer Service Representative's Perception of the AI Assistant in an Organization's Call CenterKai Qin, Kexin Du, Yimeng Chen et al.
The integration of various AI tools creates a complex socio-technical environment where employee-customer interactions form the core of work practices. This study investigates how customer service representatives (CSRs) at the power grid service customer service call center perceive AI assistance in their interactions with customers. Through a field visit and semi-structured interviews with 13 CSRs, we found that AI can alleviate some traditional burdens during the call (e.g., typing and memorizing) but also introduces new burdens (e.g., earning, compliance, psychological burdens). This research contributes to a more nuanced understanding of AI integration in organizational settings and highlights the efforts and burdens undertaken by CSRs to adapt to the updated system.
92.1HCApr 9
StoryEcho: A Generative Child-as-Actor Storytelling System for Picky-Eating InterventionYanuo Zhou, Jun Fang, Yuntao Wang et al.
Picky eating in children can undermine dietary diversity and the development of healthy eating habits, while also creating recurring tension in family feeding routines. Prior interventions have explored food-centered designs, enhanced utensils, and mealtime interactive systems, but few position children as active participants in intervention processes that extend beyond single mealtime interactions. To better understand everyday responses to picky eating and child-acceptable intervention mechanisms, we conducted a formative study with caregivers and kindergarten teachers. Based on the resulting design considerations and iterative stakeholder review, we designed StoryEcho, a generative child-as-actor storytelling system for picky eating intervention. StoryEcho engages children outside mealtimes through personalized stories in which the child appears as a persistent story character and later shapes story development through real-world food-related behavior. The system combines non-mealtime story engagement, lightweight post-meal feedback, and behavior-informed story updates to support repeated intervention across everyday family routines. We evaluated StoryEcho in a between-group field study with 11 families of preschool children. Results provide preliminary evidence that StoryEcho can significantly increase children's willingness to approach and try target low-preference foods while reducing parental pressure around feeding. These findings suggest the promise of generative child-as-actor storytelling as a design approach for home-based behavior support that unfolds through recurring family routines.
CLMar 26, 2025
SARGes: Semantically Aligned Reliable Gesture Generation via Intent ChainNan Gao, Yihua Bao, Dongdong Weng et al.
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
CVJan 26, 2025
InfoBFR: Real-World Blind Face Restoration via Information BottleneckNan Gao, Jia Li, Huaibo Huang et al.
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of data degradation patterns. Current BFR methods have realized certain restored productions but with inherent neural degradations that limit real-world generalization in complicated scenarios. In this paper, we propose a plug-and-play framework InfoBFR to tackle neural degradations, e.g., prior bias, topological distortion, textural distortion, and artifact residues, which achieves high-generalization face restoration in diverse wild and heterogeneous scenes. Specifically, based on the results from pre-trained BFR models, InfoBFR considers information compression using manifold information bottleneck (MIB) and information compensation with efficient diffusion LoRA to conduct information optimization. InfoBFR effectively synthesizes high-fidelity faces without attribute and identity distortions. Comprehensive experimental results demonstrate the superiority of InfoBFR over state-of-the-art GAN-based and diffusion-based BFR methods, with around 70ms consumption, 16M trainable parameters, and nearly 85% BFR-boosting. It is promising that InfoBFR will be the first plug-and-play restorer universally employed by diverse BFR models to conquer neural degradations.
CVMar 15, 2024
DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face RestorationNan Gao, Jia Li, Huaibo Huang et al.
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of degradation patterns. Current methods have low generalization across photorealistic and heterogeneous domains. In this paper, we propose a Diffusion-Information-Diffusion (DID) framework to tackle diffusion manifold hallucination correction (DiffMAC), which achieves high-generalization face restoration in diverse degraded scenes and heterogeneous domains. Specifically, the first diffusion stage aligns the restored face with spatial feature embedding of the low-quality face based on AdaIN, which synthesizes degradation-removal results but with uncontrollable artifacts for some hard cases. Based on Stage I, Stage II considers information compression using manifold information bottleneck (MIB) and finetunes the first diffusion model to improve facial fidelity. DiffMAC effectively fights against blind degradation patterns and synthesizes high-quality faces with attribute and identity consistencies. Experimental results demonstrate the superiority of DiffMAC over state-of-the-art methods, with a high degree of generalization in real-world and heterogeneous settings. The source code and models will be public.
AIAug 13, 2025
An Automated Multi-modal Evaluation Framework for Mobile Intelligent Assistants Based on Large Language Models and Multi-Agent CollaborationMeiping Wang, Jian Zhong, Rongduo Han et al.
With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework's effectiveness in predicting users' satisfaction and identifying generation defects.
CVMay 18, 2025
NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset CreationJia Li, Nan Gao, Huaibo Huang et al.
The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.
HCDec 23, 2021
Individual and Group-wise Classroom Seating Experience: Effects on Student Engagement in Different CoursesNan Gao, Mohammad Saiedur Rahaman, Wei Shao et al.
Seating location in the classroom can affect student engagement, attention and academic performance by providing better visibility, improved movement, and participation in discussions. Existing studies typically explore how traditional seating arrangements (e.g. grouped tables or traditional rows) influence students' perceived engagement, without considering group seating behaviours under more flexible seating arrangements. Furthermore, survey-based measures of student engagement are prone to subjectivity and various response bias. Therefore, in this research, we investigate how individual and group-wise classroom seating experiences affect student engagement using wearable physiological sensors. We conducted a field study at a high school and collected survey and wearable data from 23 students in 10 courses over four weeks. We aim to answer the following research questions: 1. How does the seating proximity between students relate to their perceived learning engagement? 2. How do students' group seating behaviours relate to their physiologically-based measures of engagement (i.e. physiological arousal and physiological synchrony)? Experiment results indicate that the individual and group-wise classroom seating experience is associated with perceived student engagement and physiologically-based engagement measured from electrodermal activity. We also find that students who sit close together are more likely to have similar learning engagement and tend to have high physiological synchrony. This research opens up opportunities to explore the implications of flexible seating arrangements and has great potential to maximize student engagement by suggesting intelligent seating choices in the future.
HCJul 1, 2021
Investigating the Reliability of Self-report Data in the Wild: The Quest for Ground TruthNan Gao, Mohammad Saiedur Rahaman, Wei Shao et al.
Inferring human mental state (e.g., emotion, depression, engagement) with sensing technology is one of the most valuable challenges in the affective computing area, which has a profound impact in all industries interacting with humans. The self-report survey is the most common way to quantify how people think, but prone to subjectivity and various responses bias. It is usually used as the ground truth for human mental state prediction. In recent years, many data-driven machine learning models are built based on self-report annotations as the target value. In this research, we investigate the reliability of self-report surveys in the wild by studying the confidence level of responses and survey completion time. We conduct a case study (i.e., student engagement inference) by recruiting 23 students in a high school setting over a period of 4 weeks. Our participants volunteered 488 self-reported responses and data from their wearable sensors. We also find the physiologically measured student engagement and perceived student engagement are not always consistent. The findings from this research have great potential to benefit future studies in predicting engagement, depression, stress, and other emotion-related states in the field of affective computing and sensing technologies.
HCMay 14, 2021
Understanding occupants' behaviour, engagement, emotion, and comfort indoors with heterogeneous sensors and wearablesNan Gao, Max Marschall, Jane Burry et al.
We conducted a field study at a K-12 private school in the suburbs of Melbourne, Australia. The data capture contained two elements: First, a 5-month longitudinal field study In-Gauge using two outdoor weather stations, as well as indoor weather stations in 17 classrooms and temperature sensors on the vents of occupant-controlled room air-conditioners; these were collated into individual datasets for each classroom at a 5-minute logging frequency, including additional data on occupant presence. The dataset was used to derive predictive models of how occupants operate room air-conditioning units. Second, we tracked 23 students and 6 teachers in a 4-week cross-sectional study En-Gage, using wearable sensors to log physiological data, as well as daily surveys to query the occupants' thermal comfort, learning engagement, emotions and seating behaviours. Overall, the combined dataset could be used to analyse the relationships between indoor/outdoor climates and students' behaviours/mental states on campus, which provide opportunities for the future design of intelligent feedback systems to benefit both students and staff.
LGAug 18, 2020
Generative Adversarial Networks for Spatio-temporal Data: A SurveyNan Gao, Hao Xue, Wei Shao et al.
Generative Adversarial Networks (GANs) have shown remarkable success in producing realistic-looking images in the computer vision area. Recently, GAN-based techniques are shown to be promising for spatio-temporal-based applications such as trajectory prediction, events generation and time-series data imputation. While several reviews for GANs in computer vision have been presented, no one has considered addressing the practical applications and challenges relevant to spatio-temporal data. In this paper, we have conducted a comprehensive review of the recent developments of GANs for spatio-temporal data. We summarise the application of popular GAN architectures for spatio-temporal data and the common practices for evaluating the performance of spatio-temporal applications with GANs. Finally, we point out future research directions to benefit researchers in this area.
HCJul 9, 2020
n-Gage: Predicting in-class Emotional, Behavioural and Cognitive Engagement in the WildNan Gao, Wei Shao, Mohammad Saiedur Rahaman et al.
The study of student engagement has attracted growing interests to address problems such as low academic performance, disaffection, and high dropout rates. Existing approaches to measuring student engagement typically rely on survey-based instruments. While effective, those approaches are time-consuming and labour-intensive. Meanwhile, both the response rate and quality of the survey are usually poor. As an alternative, in this paper, we investigate whether we can infer and predict engagement at multiple dimensions, just using sensors. We hypothesize that multidimensional student engagement can be translated into physiological responses and activity changes during the class, and also be affected by the environmental changes. Therefore, we aim to explore the following questions: Can we measure the multiple dimensions of high school student's learning engagement including emotional, behavioural and cognitive engagement with sensing data in the wild? Can we derive the activity, physiological, and environmental factors contributing to the different dimensions of student engagement? If yes, which sensors are the most useful in differentiating each dimension of the engagement? Then, we conduct an in-situ study in a high school from 23 students and 6 teachers in 144 classes over 11 courses for 4 weeks. We present the n-Gage, a student engagement sensing system using a combination of sensors from wearables and environments to automatically detect student in-class multidimensional learning engagement. Experiment results show that n-Gage can accurately predict multidimensional student engagement in real-world scenarios with an average MAE of 0.788 and RMSE of 0.975 using all the sensors. We also show a set of interesting findings of how different factors (e.g., combinations of sensors, school subjects, CO2 level) affect each dimension of the student learning engagement.
CVMay 28, 2020
Overview: Computer vision and machine learning for microstructural characterization and analysisElizabeth A. Holm, Ryan Cohn, Nan Gao et al.
The characterization and analysis of microstructure is the foundation of microstructural science, connecting the materials structure to its composition, process history, and properties. Microstructural quantification traditionally involves a human deciding a priori what to measure and then devising a purpose-built method for doing so. However, recent advances in data science, including computer vision (CV) and machine learning (ML) offer new approaches to extracting information from microstructural images. This overview surveys CV approaches to numerically encode the visual information contained in a microstructural image, which then provides input to supervised or unsupervised ML algorithms that find associations and trends in the high-dimensional image representation. CV/ML systems for microstructural characterization and analysis span the taxonomy of image analysis tasks, including image classification, semantic segmentation, object detection, and instance segmentation. These tools enable new approaches to microstructural analysis, including the development of new, rich visual metrics and the discovery of processing-microstructure-property relationships.
LGApr 29, 2020
Transfer Learning for Thermal Comfort Prediction in Multiple CitiesNan Gao, Wei Shao, Mohammad Saiedur Rahaman et al.
HVAC (Heating, Ventilation and Air Conditioning) system is an important part of a building, which constitutes up to 40% of building energy usage. The main purpose of HVAC, maintaining appropriate thermal comfort, is crucial for the best utilisation of energy usage. Besides, thermal comfort is also crucial for well-being, health, and work productivity. Recently, data-driven thermal comfort models have got better performance than traditional knowledge-based methods (e.g. Predicted Mean Vote Model). An accurate thermal comfort model requires a large amount of self-reported thermal comfort data from indoor occupants which undoubtedly remains a challenge for researchers. In this research, we aim to tackle this data-shortage problem and boost the performance of thermal comfort prediction. We utilise sensor data from multiple cities in the same climate zone to learn thermal comfort patterns. We present a transfer learning based multilayer perceptron model from the same climate zone (TL-MLP-C*) for accurate thermal comfort prediction. Extensive experimental results on ASHRAE RP-884, the Scales Project and Medium US Office datasets show that the performance of the proposed TL-MLP-C* exceeds the state-of-the-art methods in accuracy, precision and F1-score.
SPDec 6, 2019
Data Augmentation for Deep Learning-based Radio Modulation ClassificationLiang Huang, Weijian Pan, You Zhang et al.
Deep learning has recently been applied to automatically classify the modulation categories of received radio signals without manual experience. However, training deep learning models requires massive volume of data. An insufficient training data will cause serious overfitting problem and degrade the classification accuracy. To cope with small dataset, data augmentation has been widely used in image processing to expand the dataset and improve the robustness of deep learning models. However, in wireless communication areas, the effect of different data augmentation methods on radio modulation classification has not been studied yet. In this paper, we evaluate different data augmentation methods via a state-of-the-art deep learning-based modulation classifier. Based on the characteristics of modulated signals, three augmentation methods are considered, i.e., rotation, flip, and Gaussian noise, which can be applied in both training phase and inference phase of the deep learning algorithm. Numerical results show that all three augmentation methods can improve the classification accuracy. Among which, the rotation augmentation method outperforms the flip method, both of which achieve higher classification accuracy than the Gaussian noise method. Given only 12.5% of training dataset, a joint rotation and flip augmentation policy can achieve even higher classification accuracy than the baseline with initial 100% training dataset without augmentation. Furthermore, with data augmentation, radio modulation categories can be successfully classified using shorter radio samples, leading to a simplified deep learning model and shorter the classification response time.
DCSep 22, 2019
Cutting the Unnecessary Long Tail: Cost-Effective Big Data Clustering in the CloudDongwei Li, Shuliang Wang, Nan Gao et al.
Clustering big data often requires tremendous computational resources where cloud computing is undoubtedly one of the promising solutions. However, the computation cost in the cloud can be unexpectedly high if it cannot be managed properly. The long tail phenomenon has been observed widely in the big data clustering area, which indicates that the majority of time is often consumed in the middle to late stages in the clustering process. In this research, we try to cut the unnecessary long tail in the clustering process to achieve a sufficiently satisfactory accuracy at the lowest possible computation cost. A novel approach is proposed to achieve cost-effective big data clustering in the cloud. By training the regression model with the sampling data, we can make widely used k-means and EM (Expectation-Maximization) algorithms stop automatically at an early point when the desired accuracy is obtained. Experiments are conducted on four popular data sets and the results demonstrate that both k-means and EM algorithms can achieve high cost-effectiveness in the cloud with our proposed approach. For example, in the case studies with the much more efficient k-means algorithm, we find that achieving a 99% accuracy needs only 47.71%-71.14% of the computation cost required for achieving a 100% accuracy while the less efficient EM algorithm needs 16.69%-32.04% of the computation cost. To put that into perspective, in the United States land use classification example, our approach can save up to $94,687.49 for the government in each use.
HCJun 19, 2019
Predicting Personality Traits from Physical Activity IntensityNan Gao, Wei Shao, Flora D Salim
Call and messaging logs from mobile devices have been used to predict human personality traits successfully in recent years. However, the widely available accelerometer data is not yet utilized for this purpose. In this research, we explored some important features describing human physical activity intensity, used for the very first time to predict human personality traits through raw accelerometer data. Using a set of newly introduced metrics, we combined physical activity intensity features with traditional phone activity features for personality prediction. The experiment results show that the predicted personality scores are closer to the ground truth, with observable reduction of errors in predicting the Big-5 personality traits across male and female.