CVOct 3, 2022Code
rPPG-Toolbox: Deep Remote PPG ToolboxXin Liu, Girish Narayanswamy, Akshay Paruchuri et al. · stanford, tsinghua
Camera-based physiological measurement is a fast growing field of computer vision. Remote photoplethysmography (rPPG) utilizes imaging devices (e.g., cameras) to measure the peripheral blood volume pulse (BVP) via photoplethysmography, and enables cardiac measurement via webcams and smartphones. However, the task is non-trivial with important pre-processing, modeling, and post-processing steps required to obtain state-of-the-art results. Replication of results and benchmarking of new models is critical for scientific progress; however, as with many other applications of deep learning, reliable codebases are not easy to find or use. We present a comprehensive toolbox, rPPG-Toolbox, that contains unsupervised and supervised rPPG models with support for public benchmark datasets, data augmentation, and systematic evaluation: \url{https://github.com/ubicomplab/rPPG-Toolbox}
CVFeb 8, 2023Code
MMPD: Multi-Domain Mobile Video Physiology DatasetJiankai Tang, Kequan Chen, Yuntao Wang et al. · tsinghua
Remote photoplethysmography (rPPG) is an attractive method for noninvasive, convenient and concomitant measurement of physiological vital signals. Public benchmark datasets have served a valuable role in the development of this technology and improvements in accuracy over recent years.However, there remain gaps in the public datasets.First, despite the ubiquity of cameras on mobile devices, there are few datasets recorded specifically with mobile phone cameras. Second, most datasets are relatively small and therefore are limited in diversity, both in appearance (e.g., skin tone), behaviors (e.g., motion) and environment (e.g., lighting conditions). In an effort to help the field advance, we present the Multi-domain Mobile Video Physiology Dataset (MMPD), comprising 11 hours of recordings from mobile phones of 33 subjects. The dataset is designed to capture videos with greater representation across skin tone, body motion, and lighting conditions. MMPD is comprehensive with eight descriptive labels and can be used in conjunction with the rPPG-toolbox. The reliability of the dataset is verified by mainstream unsupervised methods and neural methods. The GitHub repository of our dataset: https://github.com/THU-CS-PI/MMPD_rPPG_dataset.
AISep 22, 2024
Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future TrendsYuntao Wang, Yanghe Pan, Zhou Su et al.
With the rapid advancement of large models (LMs), the development of general-purpose intelligent agents powered by LMs has become a reality. It is foreseeable that in the near future, LM-driven general AI agents will serve as essential tools in production tasks, capable of autonomous communication and collaboration without human intervention. This paper investigates scenarios involving the autonomous collaboration of future LM agents. We review the current state of LM agents, the key technologies enabling LM agent collaboration, and the security and privacy challenges they face during cooperative operations. To this end, we first explore the foundational principles of LM agents, including their general architecture, key components, enabling technologies, and modern applications. We then discuss practical collaboration paradigms from data, computation, and knowledge perspectives to achieve connected intelligence among LM agents. After that, we analyze the security vulnerabilities and privacy risks associated with LM agents, particularly in multi-agent settings, examining underlying mechanisms and reviewing current and potential countermeasures. Lastly, we propose future research directions for building robust and secure LM agent ecosystems.
79.2AIMay 25Code
Security of OpenClaw Agents: Fundamentals, Attacks, and CountermeasuresYuntao Wang, Jianle Ba, Han Liu et al.
The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.
CVSep 28, 2024
Summit Vitals: Multi-Camera and Multi-Signal Biosensing at High AltitudesKe Liu, Jiankai Tang, Zhang Jiang et al. · tsinghua
Video photoplethysmography (vPPG) is an emerging method for non-invasive and convenient measurement of physiological signals, utilizing two primary approaches: remote video PPG (rPPG) and contact video PPG (cPPG). Monitoring vitals in high-altitude environments, where heart rates tend to increase and blood oxygen levels often decrease, presents significant challenges. To address these issues, we introduce the SUMS dataset comprising 80 synchronized non-contact facial and contact finger videos from 10 subjects during exercise and oxygen recovery scenarios, capturing PPG, respiration rate (RR), and SpO2. This dataset is designed to validate video vitals estimation algorithms and compare facial rPPG with finger cPPG. Additionally, fusing videos from different positions (i.e., face and finger) reduces the mean absolute error (MAE) of SpO2 predictions by 7.6\% and 10.6\% compared to only face and only finger, respectively. In cross-subject evaluation, we achieve an MAE of less than 0.5 BPM for HR estimation and 2.5\% for SpO2 estimation, demonstrating the precision of our multi-camera fusion techniques. Our findings suggest that simultaneous training on multiple indicators, such as PPG and blood oxygen, can reduce MAE in SpO2 estimation by 17.8\%.
CRDec 25, 2022
Social-Aware Clustered Federated Learning with Customized Privacy PreservationYuntao Wang, Zhou Su, Yanghe Pan et al.
A key feature of federated learning (FL) is to preserve the data privacy of end users. However, there still exist potential privacy leakage in exchanging gradients under FL. As a result, recent research often explores the differential privacy (DP) approaches to add noises to the computing results to address privacy concerns with low overheads, which however degrade the model performance. In this paper, we strike the balance of data privacy and efficiency by utilizing the pervasive social connections between users. Specifically, we propose SCFL, a novel Social-aware Clustered Federated Learning scheme, where mutually trusted individuals can freely form a social cluster and aggregate their raw model updates (e.g., gradients) inside each cluster before uploading to the cloud for global aggregation. By mixing model updates in a social group, adversaries can only eavesdrop the social-layer combined results, but not the privacy of individuals. We unfold the design of SCFL in three steps.i) Stable social cluster formation. Considering users' heterogeneous training samples and data distributions, we formulate the optimal social cluster formation problem as a federation game and devise a fair revenue allocation mechanism to resist free-riders. ii) Differentiated trust-privacy mapping}. For the clusters with low mutual trust, we design a customizable privacy preservation mechanism to adaptively sanitize participants' model updates depending on social trust degrees. iii) Distributed convergence}. A distributed two-sided matching algorithm is devised to attain an optimized disjoint partition with Nash-stable convergence. Experiments on Facebook network and MNIST/CIFAR-10 datasets validate that our SCFL can effectively enhance learning utility, improve user payoff, and enforce customizable privacy protection.
GRAug 29, 2024Code
GSDiff: Synthesizing Vector Floorplans via Geometry-enhanced Structural Graph GenerationSizhe Hu, Wenming Wu, Yuntao Wang et al.
Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model's perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design. Code and data are available at https://github.com/SizheHu/GSDiff.
HCMar 18, 2023
Modeling the Trade-off of Privacy Preservation and Activity Recognition on Low-Resolution ImagesYuntao Wang, Zirui Cheng, Xin Yi et al.
A computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.
CVNov 4, 2025Code
M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical SettingsJiankai Tang, Tao Zhang, Jia Li et al.
Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.
LGNov 21, 2023
ALPHA: AnomaLous Physiological Health Assessment Using Large Language ModelsJiankai Tang, Kegang Wang, Hongming Hu et al. · tsinghua
This study concentrates on evaluating the efficacy of Large Language Models (LLMs) in healthcare, with a specific focus on their application in personal anomalous health monitoring. Our research primarily investigates the capabilities of LLMs in interpreting and analyzing physiological data obtained from FDA-approved devices. We conducted an extensive analysis using anomalous physiological data gathered in a simulated low-air-pressure plateau environment. This allowed us to assess the precision and reliability of LLMs in understanding and evaluating users' health status with notable specificity. Our findings reveal that LLMs exhibit exceptional performance in determining medical indicators, including a Mean Absolute Error (MAE) of less than 1 beat per minute for heart rate and less than 1% for oxygen saturation (SpO2). Furthermore, the Mean Absolute Percentage Error (MAPE) for these evaluations remained below 1%, with the overall accuracy of health assessments surpassing 85%. In image analysis tasks, such as interpreting photoplethysmography (PPG) data, our specially adapted GPT models demonstrated remarkable proficiency, achieving less than 1 bpm error in cycle count and 7.28 MAE for heart rate estimation. This study highlights LLMs' dual role as health data analysis tools and pivotal elements in advanced AI health assistants, offering personalized health insights and recommendations within the future health assistant framework.
68.8CRMay 26
Secure UAV Swarms in Low-Altitude Wireless Networks: Challenges and SolutionsYuntao Wang, Haojia Yang, Han Liu et al.
Unmanned aerial vehicle (UAV) swarms are increasingly deployed in vast low-altitude applications, owing to their capabilities in distributed sensing, flexible communication, and autonomous coordination. Nevertheless, the open and highly dynamic operating environment of UAV swarms introduces serious security risks, including GPS spoofing, insider threats, and multi-hop intrusion. These threats are aggravated by limited on-board resources, frequently changing network topology, and the presence of intelligent adversaries. To tackle these issues, this paper proposes a cloud-edge-end collaborative defense framework for UAV swarms. Based on this framework, three complementary mechanisms are developed. First, a cooperative perception scheme is designed to resist GPS spoofing via interactive attack-defense game modeling. Second, a behavior-driven authentication method with trust evaluation is developed to mitigate insider threats. Third, a multi-agent attack forensics framework is devised to intelligently trace the propagation paths of multi-hop attacks in UAV networks. Experimental results validate the effectiveness of the proposed approaches. Finally, several open research directions are outlined.
SDMar 18, 2023
EarCough: Enabling Continuous Subject Cough Event Detection on HearablesXiyuxing Zhang, Yuntao Wang, Jingru Zhang et al.
Cough monitoring can enable new individual pulmonary health applications. Subject cough event detection is the foundation for continuous cough monitoring. Recently, the rapid growth in smart hearables has opened new opportunities for such needs. This paper proposes EarCough, which enables continuous subject cough event detection on edge computing hearables by leveraging the always-on active noise cancellation (ANC) microphones. Specifically, we proposed a lightweight end-to-end neural network model -- EarCoughNet. To evaluate the effectiveness of our method, we constructed a synchronous motion and audio dataset through a user study. Results show that EarCough achieved an accuracy of 95.4% and an F1-score of 92.9% with a space requirement of only 385 kB. We envision EarCough as a low-cost add-on for future hearables to enable continuous subject cough event detection.
HCMar 18, 2023
GazeReader: Detecting Unknown Word Using Webcam for English as a Second Language (ESL) LearnersJiexin Ding, Bowen Zhao, Yuqi Huang et al.
Automatic unknown word detection techniques can enable new applications for assisting English as a Second Language (ESL) learners, thus improving their reading experiences. However, most modern unknown word detection methods require dedicated eye-tracking devices with high precision that are not easily accessible to end-users. In this work, we propose GazeReader, an unknown word detection method only using a webcam. GazeReader tracks the learner's gaze and then applies a transformer-based machine learning model that encodes the text information to locate the unknown word. We applied knowledge enhancement including term frequency, part of speech, and named entity recognition to improve the performance. The user study indicates that the accuracy and F1-score of our method were 98.09% and 75.73%, respectively. Lastly, we explored the design scope for ESL reading and discussed the findings.
CVOct 14, 2022
MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity RecognitionZiqi Gao, Yuntao Wang, Jianguo Chen et al.
Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements 11.13% of cross-subject F1-score on the MMAct dataset than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.
NINov 24, 2025
Agent Discovery in Internet of Agents: Challenges and SolutionsShaolong Guo, Yuntao Wang, Zhou Su et al.
Rapid advances in large language models and agentic AI are driving the emergence of the Internet of Agents (IoA), a paradigm where billions of autonomous software and embodied agents interact, coordinate, and collaborate to accomplish complex tasks. A key prerequisite for such large-scale collaboration is agent capability discovery, where agents identify, advertise, and match one another's capabilities under dynamic tasks. Agent's capability in IoA is inherently heterogeneous and context-dependent, raising challenges in capability representation, scalable discovery, and long-term performance. To address these issues, this paper introduces a novel two-stage capability discovery framework. The first stage, autonomous capability announcement, allows agents to credibly publish machine-interpretable descriptions of their abilities. The second stage, task-driven capability discovery, enables context-aware search, ranking, and composition to locate and assemble suitable agents for specific tasks. Building on this framework, we propose a novel scheme that integrates semantic capability modeling, scalable and updatable indexing, and memory-enhanced continual discovery. Simulation results demonstrate that our approach enhances discovery performance and scalability. Finally, we outline a research roadmap and highlight open problems and promising directions for future IoA.
CVMar 6Code
Adaptive Language-Aware Image Reflection Removal NetworkSiyan Fang, Yuntao Wang, Jinpu Zhang et al.
Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model's performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.
34.2HCMay 10
AuthGlass: Benchmarking Voice Liveness Detection and Authentication on Smart Glasses via Comprehensive Acoustic FeaturesWeiye Xu, Zhang Jiang, Siqi Zheng et al.
With the rapid advancement of smart glasses, voice interaction has been widely adopted due to its naturalness and convenience. However, its practical deployment is often undermined by vulnerability to spoofing attacks, while no public dataset currently exists for voice liveness detection and authentication in smart-glasses scenarios. To address this challenge, we first collect a multi-acoustic-modal dataset comprising 16-channel audio data from 42 subjects, along with corresponding attack samples covering two attack categories. Based on insights derived from this collected data, we propose AuthG-Live, a sound-field-based voice liveness detection method, and AuthG-Net, a multi-acoustic-modal authentication model. We further benchmark seven voice liveness detection methods and four authentication methods across diverse acoustic modalities. The results demonstrate that our proposed approach achieves state-of-the-art performance on four benchmark tasks, and extensive ablation studies validate the generalizability of our methods \red{under real-world constraints}. Finally, we release this dataset, termed AuthGlass, to facilitate future research on voice liveness detection and authentication for smart glasses.
46.9HCMar 27
Routine Computing: A Systematic Review of Sensing Daily Life Dimensions Towards Human-Centered GoalsBorislav Pavlov, Jiajin Li, Jun Fang et al.
Human routines structure daily life, yet remain challenging for computational systems to understand. This paper presents the first systematic review of routine computing, a previously implicit but increasingly recognized field that focuses on computationally sensing and modeling human behaviors. It synthesizes 203 studies published up to August 2025. The paper presents a new taxonomy of the literature, focusing on temporal structures, behavioral interactions, cognitive aspects, and how variability and deviations are addressed. The common goals of routine computing extend across four major application domains, including accessibility care, the promotion of healthy habits, adaptive and context-aware support, and large-scale population insights. Persistent challenges that limit the design of truly human-centered systems are identified, including the gap between low-level activity recognition and high-level intent, the tension between personalization and generalization, unresolved privacy concerns, and data-related limitations. By consolidating these findings, this paper provides a foundational framework for HCI researchers, outlining principles for designing ethical, adaptive, and human-centered routine-aware systems.
LGOct 31, 2025
ECVL-ROUTER: Scenario-Aware Routing for Vision-Language ModelsXin Tang, Youfang Han, Fangfei Gou et al.
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
AIJul 25, 2025Code
PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver MonitoringJiyao Wang, Xiao Yang, Qingyong Hu et al. · tsinghua
Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.
91.0CVMay 17
EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State ReasoningZeyu Wang, Chang Liu, Eduardus Tjitrahardja et al.
Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/
42.9HCApr 8
LubDubDecoder: Bringing Micro-Mechanical Cardiac Monitoring to HearablesSiqi Zhang, Xiyuxing Zhang, Duc Vu et al.
We present LubDubDecoder, a system that enables fine-grained monitoring of micro-cardiac vibrations associated with the opening and closing of heart valves across a range of hearables. Our system transforms the built-in speaker, the only transducer common to all hearables, into an acoustic sensor that captures the coarse "lub-dub" heart sounds, leverages their shared temporal and spectral structure to reconstruct the subtle seismocardiography (SCG) and gyrocardiography (GCG) waveforms, and extract the timing of key micro-cardiac events. In an IRB-approved feasibility study with 25 users, our system achieves correlations of 0.88-0.95 compared to chest-mounted reference measurements in within-user and cross-user evaluations, and generalizes to unseen hearables using a zero-effort adaptation scheme with a correlation of 0.91. Our system is robust across remounting sessions and music playback.
48.4HCApr 7
SpeakSoftly: Scaffolding Nonviolent Communication in Intimate Relationships through LLM-Powered Just-In-Time InterventionsKa I Chan, Hongbo Lan, Jun Fang et al.
Conflicts are common in text-based communication, particularly in intimate relationships, where misunderstandings can easily escalate into verbal aggression. To address this, we present SpeakSoftly, a system that applies Nonviolent Communication (NVC) principles to scaffold couples' conflict communication through LLM-powered just-in-time interventions. Informed by formative interviews with couples and NVC principles, we designed two core features: NVC-Prompt, which detects verbal aggression and suggests revisions to prevent escalation, and NVC-Guide, which analyzes dialogues to uncover users' feelings and needs, fostering self-awareness and perspective-taking. These features were implemented across three progressive intervention modes, each varying in intervention depth and tone: Basic Reminder, Neutral Guide, and Empathetic Guide. We conducted a mixed-methods user study with 18 couples across simulated and real-life conflict settings to evaluate the effectiveness of each mode. Results showed that Empathetic Guide significantly facilitated both behavioral and cognitive changes, while Neutral Guide was effective only for behavioral changes in simulated conflicts. In real-life conflicts, Neutral Guide showed distinct advantages due to lower cognitive load demands. We discuss the mechanisms behind these findings and propose design implications for in-situ interventions in high-stakes communication contexts.
CVJun 11, 2025Code
Non-Contact Health Monitoring During Daily Personal Care RoutinesXulin Ma, Jiankai Tang, Zhang Jiang et al. · tsinghua
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
26.1CRMay 6
Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive LearningJiangnan Zhu, Yuntao Wang, Shengli Pan et al.
Today, advances in medical technology extensively utilize 3D volume data for accurate and efficient diagnostics. However, sharing these data across networks in telemedicine poses significant security risks of data tampering and unauthorized copying. To address these challenges, this paper proposes a novel reversible-zero watermarking approach, termed Vol-Mark, for medical volume data to protect their ownership and authenticity in telemedicine. The proposed Vol-Mark method offers two key benefits: 1) it designs a volume data feature extractor that leverages contrastive learning to efficiently extract discriminative and stable volumetric features, ensuring robustness against 3D attacks; 2) it introduces the cubic difference expansion (c-DE) technique, which leverages the 3D integer wavelet transform to embed watermark bits into neighboring voxels within cubes at low-frequency coefficients. The voxel differences within each cube are expanded to create embedding space, and a majority voting mechanism is employed during extraction to enhance reliability. The embedding process incurs low distortion and supports lossless removal, thereby preserving the integrity and diagnostic accuracy of medical volume data. Through these two benefits, Vol-Mark enables both integrity verification and ownership verification. Integrity verification is first performed, and ownership verification through hypothesis testing is further conducted to enhance reliability, particularly under data tampering or watermark removal attacks. Comprehensive experimental results show the effectiveness of the proposed method and its superior robustness against conventional, geometric, and hybrid attacks on medical volume data. In particular, through multiple tasks evaluations, Vol-Mark consistently achieves an ACC above 0.90 in most attack scenarios, outperforming existing methods by a clear margin.
CVMar 31, 2025Code
Exploring Reliable PPG Authentication on Smartwatches in Daily ScenariosJiankai Tang, Jiacheng Liu, Renling Tong et al. · tsinghua
Photoplethysmography (PPG) Sensors, widely deployed in smartwatches, offer a simple and non-invasive authentication approach for daily use. However, PPG authentication faces reliability issues due to motion artifacts from physical activity and physiological variability over time. To address these challenges, we propose MTL-RAPID, an efficient and reliable PPG authentication model, that employs a multitask joint training strategy, simultaneously assessing signal quality and verifying user identity. The joint optimization of these two tasks in MTL-RAPID results in a structure that outperforms models trained on individual tasks separately, achieving stronger performance with fewer parameters. In our comprehensive user studies regarding motion artifacts (N = 30), time variations (N = 32), and user preferences (N = 16), MTL-RAPID achieves a best AUC of 99.2\% and an EER of 3.5\%, outperforming existing baselines. We opensource our PPG authentication dataset along with the MTL-RAPID model to facilitate future research on GitHub.
CVJan 1
Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection SeparationSiyan Fang, Long Peng, Yuntao Wang et al.
Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.
IVDec 10, 2023Code
A Comprehensive Dataset and Automated Pipeline for Nailfold Capillary AnalysisLinxi Zhao, Jiankai Tang, Dongyu Chen et al.
Nailfold capillaroscopy is widely used in assessing health conditions, highlighting the pressing need for an automated nailfold capillary analysis system. In this study, we present a pioneering effort in constructing a comprehensive nailfold capillary dataset-321 images, 219 videos from 68 subjects, with clinic reports and expert annotations-that serves as a crucial resource for training deep-learning models. Leveraging this dataset, we finetuned three deep learning models with expert annotations as supervised labels and integrated them into a novel end-to-end nailfold capillary analysis pipeline. This pipeline excels in automatically detecting and measuring a wide range of size factors, morphological features, and dynamic aspects of nailfold capillaries. We compared our outcomes with clinical reports. Experiment results showed that our automated pipeline achieves an average of sub-pixel level precision in measurements and 89.9% accuracy in identifying morphological abnormalities. These results underscore its potential for advancing quantitative medical research and enabling pervasive computing in healthcare. Our data and code are available at https://github.com/THU-CS-PI-LAB/ANFC-Automated-Nailfold-Capillary.
HCMar 3, 2024
Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse InterventionAdiba Orzikulova, Han Xiao, Zhipeng Li et al.
Despite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8\% relatively) and receptivity (>8.0\%). In addition, incorporating explanations further enhances the effectiveness by 53.8\% and 11.4\% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0$\sim$8.9\%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.
CVFeb 7, 2024
Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven TransformerMingxuan Liu, Jiankai Tang, Yongli Chen et al. · tsinghua
Artificial neural networks (ANNs) can help camera-based remote photoplethysmography (rPPG) in measuring cardiac activity and physiological signals from facial videos, such as pulse wave, heart rate and respiration rate with better accuracy. However, most existing ANN-based methods require substantial computing resources, which poses challenges for effective deployment on mobile devices. Spiking neural networks (SNNs), on the other hand, hold immense potential for energy-efficient deep learning owing to their binary and event-driven architecture. To the best of our knowledge, we are the first to introduce SNNs into the realm of rPPG, proposing a hybrid neural network (HNN) model, the Spiking-PhysFormer, aimed at reducing power consumption. Specifically, the proposed Spiking-PhyFormer consists of an ANN-based patch embedding block, SNN-based transformer blocks, and an ANN-based predictor head. First, to simplify the transformer block while preserving its capacity to aggregate local and global spatio-temporal features, we design a parallel spike transformer block to replace sequential sub-blocks. Additionally, we propose a simplified spiking self-attention mechanism that omits the value parameter without compromising the model's performance. Experiments conducted on four datasets-PURE, UBFC-rPPG, UBFC-Phys, and MMPD demonstrate that the proposed model achieves a 12.4\% reduction in power consumption compared to PhysFormer. Additionally, the power consumption of the transformer block is reduced by a factor of 12.2, while maintaining decent performance as PhysFormer and other ANN-based models.
HCMay 13, 2024
G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosZeyu Wang, Yuanchun Shi, Yuntao Wang et al.
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
CVDec 22, 2023
Voila-A: Aligning Vision-Language Models with User's Gaze AttentionKun Yan, Lei Ji, Zeyu Wang et al.
In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO dataset. Additionally, we innovate the Voila Perceiver modules to integrate gaze information into VLMs while preserving their pretrained knowledge. We evaluate Voila-A using a hold-out validation set and a newly collected VOILA-GAZE Testset, which features real-life scenarios captured with a gaze-tracking device. Our experimental results demonstrate that Voila-A significantly outperforms several baseline models. By aligning model attention with human gaze patterns, Voila-A paves the way for more intuitive, user-centric VLMs and fosters engaging human-AI interaction across a wide range of applications.
MAMay 12, 2025
Internet of Agents: Fundamentals, Applications, and ChallengesYuntao Wang, Shaolong Guo, Yanghe Pan et al.
With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.
CVApr 7, 2024
Camera-Based Remote Physiology Sensing for Hundreds of Subjects Across Skin TonesJiankai Tang, Xinyi Li, Jiacheng Liu et al. · tsinghua
Remote photoplethysmography (rPPG) emerges as a promising method for non-invasive, convenient measurement of vital signs, utilizing the widespread presence of cameras. Despite advancements, existing datasets fall short in terms of size and diversity, limiting comprehensive evaluation under diverse conditions. This paper presents an in-depth analysis of the VitalVideo dataset, the largest real-world rPPG dataset to date, encompassing 893 subjects and 6 Fitzpatrick skin tones. Our experimentation with six unsupervised methods and three supervised models demonstrates that datasets comprising a few hundred subjects(i.e., 300 for UBFC-rPPG, 500 for PURE, and 700 for MMPD-Simple) are sufficient for effective rPPG model training. Our findings highlight the importance of diversity and consistency in skin tones for precise performance evaluation across different datasets.
CRMay 12, 2025
Security of Internet of Agents: Attacks and CountermeasuresYuntao Wang, Yanghe Pan, Shaolong Guo et al.
With the rise of large language and vision-language models, AI agents have evolved into autonomous, interactive systems capable of perception, reasoning, and decision-making. As they proliferate across virtual and physical domains, the Internet of Agents (IoA) has emerged as a key infrastructure for enabling scalable and secure coordination among heterogeneous agents. This survey offers a comprehensive examination of the security and privacy landscape in IoA systems. We begin by outlining the IoA architecture and its distinct vulnerabilities compared to traditional networks, focusing on four critical aspects: identity authentication threats, cross-agent trust issues, embodied security, and privacy risks. We then review existing and emerging defense mechanisms and highlight persistent challenges. Finally, we identify open research directions to advance the development of resilient and privacy-preserving IoA ecosystems.
IVNov 22, 2024
A Plug-and-Play Temporal Normalization Module for Robust Remote PhotoplethysmographyKegang Wang, Jiankai Tang, Yantao Wei et al. · tsinghua
Remote photoplethysmography (rPPG) extracts PPG signals from subtle color changes in facial videos, showing strong potential for health applications. However, most rPPG methods rely on intensity differences between consecutive frames, missing long-term signal variations affected by motion or lighting artifacts, which reduces accuracy. This paper introduces Temporal Normalization (TN), a flexible plug-and-play module compatible with any end-to-end rPPG network architecture. By capturing long-term temporally normalized features following detrending, TN effectively mitigates motion and lighting artifacts, significantly boosting the rPPG prediction performance. When integrated into four state-of-the-art rPPG methods, TN delivered performance improvements ranging from 34.3% to 94.2% in heart rate measurement tasks across four widely-used datasets. Notably, TN showed even greater performance gains in smaller models. We further discuss and provide insights into the mechanisms behind TN's effectiveness.
HCMay 22, 2024
AUGlasses: Continuous Action Unit based Facial Reconstruction with Low-power IMUs on Smart GlassesYanrong Li, Tengxiang Zhang, Xin Zeng et al.
Recent advancements in augmented reality (AR) have enabled the use of various sensors on smart glasses for applications like facial reconstruction, which is vital to improve AR experiences for virtual social activities. However, the size and power constraints of smart glasses demand a miniature and low-power sensing solution. AUGlasses achieves unobtrusive low-power facial reconstruction by placing inertial measurement units (IMU) against the temporal area on the face to capture the skin deformations, which are caused by facial muscle movements. These IMU signals, along with historical data on facial action units (AUs), are processed by a transformer-based deep learning model to estimate AU intensities in real-time, which are then used for facial reconstruction. Our results show that AUGlasses accurately predicts the strength (0-5 scale) of 14 key AUs with a cross-user mean absolute error (MAE) of 0.187 (STD = 0.025) and achieves facial reconstruction with a cross-user MAE of 1.93 mm (STD = 0.353). We also integrated various preprocessing and training techniques to ensure robust performance for continuous sensing. Micro-benchmark tests indicate that our system consistently performs accurate continuous facial reconstruction with a fine-tuned cross-user model, achieving an AU MAE of 0.35.
92.1HCApr 9
StoryEcho: A Generative Child-as-Actor Storytelling System for Picky-Eating InterventionYanuo Zhou, Jun Fang, Yuntao Wang et al.
Picky eating in children can undermine dietary diversity and the development of healthy eating habits, while also creating recurring tension in family feeding routines. Prior interventions have explored food-centered designs, enhanced utensils, and mealtime interactive systems, but few position children as active participants in intervention processes that extend beyond single mealtime interactions. To better understand everyday responses to picky eating and child-acceptable intervention mechanisms, we conducted a formative study with caregivers and kindergarten teachers. Based on the resulting design considerations and iterative stakeholder review, we designed StoryEcho, a generative child-as-actor storytelling system for picky eating intervention. StoryEcho engages children outside mealtimes through personalized stories in which the child appears as a persistent story character and later shapes story development through real-world food-related behavior. The system combines non-mealtime story engagement, lightweight post-meal feedback, and behavior-informed story updates to support repeated intervention across everyday family routines. We evaluated StoryEcho in a between-group field study with 11 families of preschool children. Results provide preliminary evidence that StoryEcho can significantly increase children's willingness to approach and try target low-preference foods while reducing parental pressure around feeding. These findings suggest the promise of generative child-as-actor storytelling as a design approach for home-based behavior support that unfolds through recurring family routines.
CVApr 2, 2025
Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space DualityKegang Wang, Jiankai Tang, Yuxuan Fan et al. · tsinghua
Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency -- surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on https://health-hci-group.github.io/ME-rPPG-demo/.
HCFeb 14, 2025
Unknown Word Detection for English as a Second Language (ESL) Learners Using Gaze and Pre-trained Language ModelsJiexin Ding, Bowen Zhao, Yuntao Wang et al.
English as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.
92.8HCMar 31
Exploring and Analyzing the Effect of Avatar's Visual Style on Anxiety of English as Second Language (ESL) SpeakersTianqi Liu, Xin Yi, Yuanchun Shi et al.
Virtual avatars offer new opportunities to reshape communication experiences beyond traditional live video. However, it remains unclear how avatar representations influence communication anxiety for English as a Second Language (ESL) speakers, and why such effects emerge. To take a first step to address this, we conducted a controlled laboratory study in which Mandarin-speaking ESL participants engaged in one-on-one conversations under three representation conditions: live video, stylized avatars, and realistic avatars. We assessed anxiety using both self-reported measures and physiological signals (EDA, ECG, PPG). Our results show that avatar style plays a critical role in shaping communication anxiety. While live video remained a strong baseline with low subjective anxiety, stylized avatars achieved comparable-and in some cases lower-physiological anxiety levels, whereas realistic avatars elicited higher anxiety. Beyond these effects, our findings reveal three underlying mechanisms that explain how avatar representations shape ESL communication anxiety: (1) facial expressiveness; (2) perceived feedback and fear of negative evaluation; and (3) contextual appropriateness. This work provides actionable design implications for developing avatar-mediated communication systems that support emotionally sustainable cross-linguistic interaction.
CVSep 26, 2025
Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction ParadigmZeyu Wang, Baiyu Chen, Kun Yan et al.
With the rise in popularity of smart glasses, users' attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users' attention may introduce ambiguity challenges: (1) users' verbal questions become ambiguous by using pronouns or skipping context, (2) humans' gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user's attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model's effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
NISep 25, 2025
Trustworthy Semantic Communication for Vehicular Networks: Challenges and SolutionsYanghe Pan, Yuntao Wang, Shaolong Guo et al.
Semantic communication (SemCom) has the potential to significantly reduce communication delay in vehicle-to-everything (V2X) communications within vehicular networks (VNs). However, the deployment of vehicular SemCom networks (VN-SemComNets) faces critical trust challenges in information transmission, semantic encoding, and communication entity reliability. This paper proposes an innovative three-layer trustworthy VN-SemComNet architecture. Specifically, we introduce a semantic camouflage transmission mechanism leveraging defensive adversarial noise for active eavesdropping defense, a robust federated encoder-decoder training framework to mitigate encoder-decoder poisoning attacks, and an audit game-based distributed vehicle trust management mechanism to deter untrustworthy vehicles. A case study validates the effectiveness of the proposed solutions. Lastly, essential future research directions are pointed out to advance this emerging field.
AISep 11, 2025
Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and SolutionsQinnan Hu, Yuntao Wang, Yuan Gao et al.
Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.
HCFeb 9, 2025
WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on SmartwatchYing Lei, Yancheng Cao, Will Wang et al.
While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of samples. For the model to detect new actions based on limited new data samples, we developed a few-shot learning pipeline that finetuned a pre-trained inertial measurement unit (IMU) model on public hand-gesture datasets. We then designed a data augmentation and synthesis process to train additional classification layers for customization. Our offline evaluation with 26 participants showed that with three, five, and ten examples, our approach achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of 74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to compare WatchGuardian against a rule-based intervention. Our results demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in undesirable actions, substantially outperforming the baseline by 29.0%. Our findings underscore the effectiveness of a customizable, AI-driven JITI system for individuals in need of behavioral intervention in personal undesirable actions. We envision that our work can inspire broader applications of user-defined personalized intervention with advanced AI solutions.
CYMay 25, 2023
A Survey on ChatGPT: AI-Generated Contents, Challenges, and SolutionsYuntao Wang, Yanghe Pan, Miao Yan et al.
With the widespread use of large artificial intelligence (AI) models such as ChatGPT, AI-generated content (AIGC) has garnered increasing attention and is leading a paradigm shift in content creation and knowledge representation. AIGC uses generative large AI algorithms to assist or replace humans in creating massive, high-quality, and human-like content at a faster pace and lower cost, based on user-provided prompts. Despite the recent significant progress in AIGC, security, privacy, ethical, and legal challenges still need to be addressed. This paper presents an in-depth survey of working principles, security and privacy threats, state-of-the-art solutions, and future challenges of the AIGC paradigm. Specifically, we first explore the enabling technologies, general architecture of AIGC, and discuss its working modes and key characteristics. Then, we investigate the taxonomy of security and privacy threats to AIGC and highlight the ethical and societal implications of GPT and AIGC technologies. Furthermore, we review the state-of-the-art AIGC watermarking approaches for regulatable AIGC paradigms regarding the AIGC model and its produced content. Finally, we identify future challenges and open research directions related to AIGC.
CVMay 7, 2023
Camera-Based HRV Prediction for Remote Learning EnvironmentsKegang Wang, Yantao Wei, Jiankai Tang et al.
In recent years, due to the widespread use of internet videos, remote photoplethysmography (rPPG) has gained more and more attention in the fields of affective computing. Restoring blood volume pulse (BVP) signals from facial videos is a challenging task that involves a series of preprocessing, image algorithms, and postprocessing to restore waveforms. Not only is the heart rate metric utilized for affective computing, but the heart rate variability (HRV) metric is even more significant. The challenge in obtaining HRV indices through rPPG lies in the necessity for algorithms to precisely predict the BVP peak positions. In this paper, we collected the Remote Learning Affect and Physiology (RLAP) dataset, which includes over 32 hours of highly synchronized video and labels from 58 subjects. This is a public dataset whose BVP labels have been meticulously designed to better suit the training of HRV models. Using the RLAP dataset, we trained a new model called Seq-rPPG, it is a model based on one-dimensional convolution, and experimental results reveal that this structure is more suitable for handling HRV tasks, which outperformed all other baselines in HRV performance and also demonstrated significant advantages in computational efficiency.
CVJan 11, 2022
MobilePhys: Personalized Mobile Camera-Based Contactless Physiological SensingXin Liu, Yuntao Wang, Sinan Xie et al.
Camera-based contactless photoplethysmography refers to a set of popular techniques for contactless physiological measurement. The current state-of-the-art neural models are typically trained in a supervised manner using videos accompanied by gold standard physiological measurements. However, they often generalize poorly out-of-domain examples (i.e., videos that are unlike those in the training set). Personalizing models can help improve model generalizability, but many personalization techniques still require some gold standard data. To help alleviate this dependency, in this paper, we present a novel mobile sensing system called MobilePhys, the first mobile personalized remote physiological sensing system, that leverages both front and rear cameras on a smartphone to generate high-quality self-supervised labels for training personalized contactless camera-based PPG models. To evaluate the robustness of MobilePhys, we conducted a user study with 39 participants who completed a set of tasks under different mobile devices, lighting conditions/intensities, motion tasks, and skin types. Our results show that MobilePhys significantly outperforms the state-of-the-art on-device supervised training and few-shot adaptation methods. Through extensive user studies, we further examine how does MobilePhys perform in complex real-world settings. We envision that calibrated or personalized camera-based contactless PPG models generated from our proposed dual-camera mobile sensing system will open the door for numerous future applications such as smart mirrors, fitness and mobile health applications.
SDDec 24, 2021
Enabling Real-time On-chip Audio Super Resolution for Bone Conduction MicrophonesYuang Li, Yuntao Wang, Xin Liu et al.
Voice communication using the air conduction microphone in noisy environments suffers from the degradation of speech audibility. Bone conduction microphones (BCM) are robust against ambient noises but suffer from limited effective bandwidth due to their sensing mechanism. Although existing audio super resolution algorithms can recover the high frequency loss to achieve high-fidelity audio, they require considerably more computational resources than available in low-power hearable devices. This paper proposes the first-ever real-time on-chip speech audio super resolution system for BCM. To accomplish this, we built and compared a series of lightweight audio super resolution deep learning models. Among all these models, ATS-UNet is the most cost-efficient because the proposed novel Audio Temporal Shift Module (ATSM) reduces the network's dimensionality while maintaining sufficient temporal features from speech audios. Then we quantized and deployed the ATS-UNet to low-end ARM micro-controller units for real-time embedded prototypes. Evaluation results show that our system achieved real-time inference speed on Cortex-M7 and higher quality than the baseline audio super resolution method. Finally, we conducted a user study with ten experts and ten amateur listeners to evaluate our method's effectiveness to human ears. Both groups perceived a significantly higher speech quality with our method when compared to the solutions with the original BCM or air conduction microphone with cutting-edge noise reduction algorithms.
HCJun 2, 2021
Understanding the Design Space of Mouth MicrogesturesVictor Chen, Xuhai Xu, Richard Li et al.
As wearable devices move toward the face (i.e. smart earbuds, glasses), there is an increasing need to facilitate intuitive interactions with these devices. Current sensing techniques can already detect many mouth-based gestures; however, users' preferences of these gestures are not fully understood. In this paper, we investigate the design space and usability of mouth-based microgestures. We first conducted brainstorming sessions (N=16) and compiled an extensive set of 86 user-defined gestures. Then, with an online survey (N=50), we assessed the physical and mental demand of our gesture set and identified a subset of 14 gestures that can be performed easily and naturally. Finally, we conducted a remote Wizard-of-Oz usability study (N=11) mapping gestures to various daily smartphone operations under a sitting and walking context. From these studies, we develop a taxonomy for mouth gestures, finalize a practical gesture set for common applications, and provide design guidelines for future mouth-based gesture interactions.