Jiyao Wang

CV
h-index23
20papers
107citations
Novelty47%
AI Score54

20 Papers

LGMay 27
Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning

Jiyao Wang, Peiyu Duan, Nicha C. Dvornek et al.

Functional magnetic resonance imaging (fMRI) is a powerful tool for investigating human brain function. However, the high cost of data acquisition and the inherent subjectivity of psychiatric rating scales often lead to datasets with small sample sizes and variable label quality, especially when targeting a specific neurological condition. Combined with the inherently high dimensionality of fMRI data, these limitations substantially increase the risk of model overfitting. Recent years have seen growing interest in developing fMRI foundation models by combining multiple datasets; however, the computational resources needed for pretraining and fine-tuning are often prohibitive. We show that a lightweight self-supervised framework yields representations that generalize across diverse downstream tasks, outperforming fully supervised baselines and approaching the performance of large-scale models. We introduce BrainSimSiam, a data-efficient self-supervised representation learning framework that leverages positive-only data pairs to learn robust and generalizable features. We demonstrate that the learned representations achieve strong performance across multiple downstream classification and regression tasks, highlighting the potential of BrainSimSiam for data-limited neuroimaging applications.

IVAug 29, 2023
Learning Sequential Information in Task-based fMRI for Synthetic Data Augmentation

Jiyao Wang, Nicha C. Dvornek, Lawrence H. Staib et al.

Insufficiency of training data is a persistent issue in medical image analysis, especially for task-based functional magnetic resonance images (fMRI) with spatio-temporal imaging data acquired using specific cognitive tasks. In this paper, we propose an approach for generating synthetic fMRI sequences that can then be used to create augmented training datasets in downstream learning tasks. To synthesize high-resolution task-specific fMRI, we adapt the $α$-GAN structure, leveraging advantages of both GAN and variational autoencoder models, and propose different alternatives in aggregating temporal information. The synthetic images are evaluated from multiple perspectives including visualizations and an autism spectrum disorder (ASD) classification task. The results show that the synthetic task-based fMRI can provide effective data augmentation in learning the ASD classification task.

LGMay 26
FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

Peiyu Duan, Jiyao Wang, Nicha C. Dvornek et al.

Task-based fMRI provides a direct readout of task-evoked neural dynamics, but it is expensive and difficult to acquire at scale, motivating rest-to-task synthesis from widely available resting-state fMRI (rsfMRI). We propose FM-fMRI, an event-conditioned flow-matching model that learns a continuous-time conditional vector field to generate task ROI time series from a subject's rsfMRI and the task event information. The formulation enables fast ODE-based sampling and flexible conditioning over heterogeneous event schedules. Rather than optimizing for pointwise reconstruction, we evaluated generated signals using complementary criteria that probe temporal and spectral structure, subject and group-level connectome consistency, and distributional alignment. On the public Human Connectome Project and internal BioPoint autism cohort, FM-fMRI achieves the strongest spectral and connectivity agreement and improved distribution-level matching over conditional diffusion, generative adversarial networks (GANs), and variational autoencoders (VAEs) baselines. Furthermore, we augment the BioPoint cohort by synthesizing task-fMRI ROI time series with our method, improving downstream autism classification and demonstrating practical utility in data-limited clinical settings. The code will be available on GitHub.

AIMay 2
MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration

Jiyao Wang, Yunbiao Wang, Yubo Jiao et al.

Prior studies report that partial driving automation can increase the cognitive demands on human drivers. This effect largely arises from human drivers' lack of transparent insight into the vehicle's intentions and decision logic, as well as from automated systems' limited awareness of the driver's dynamic state and preferences. This bidirectional misalignment undermines shared situational awareness and exacerbates coordination failures in human-vehicle interaction. To address these limitations, we argue for a paradigm shift that elevates the human role from passive supervisor to active manager. We introduce the Mediator-in-the-Loop-Driving (MILD) system, based on an agentic system architecture to facilitate synergistic human-vehicle collaboration. MILD integrates a perception agent for joint in-cabin and out-of-cabin understanding with a lightweight strategy agent that generates compliant and explainable action suggestions. To ensure these strategies are strictly aligned with safety regulations and human values, we develop Evidence- and Constraint-weighted Policy Optimization (ECPO). ECPO leverages automatic validators to steer the agent toward behaviors that are not only accurate but also structurally complete, substantiated by evidence, and free from constraint violations. Furthermore, a retrieval-augmented generation module dynamically incorporates constraints from traffic regulations, speed recommendations, and driver preferences into the decision loop. Field experiments across three open datasets demonstrate that MILD consistently outperforms baselines in both perception accuracy and strategy quality under auditable offline metrics, and yields higher human-rated policy adequacy, comfort, and explanation than baselines. This work offers a practical pathway for building auditable and aligned agents for human-vehicle collaborative driving.

CYMar 19
Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and System Implications

Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno et al.

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as systems that perceive their environment and execute pre-programmed tasks independently of external input, consistent with the SAE levels of automated driving. Yet recent research and real-world deployments have begun to showcase vehicles that exhibit behaviors outside the scope of this definition. These include natural language interaction with humans, goal adaptation, contextual reasoning, external tool use, and the handling of unforeseen ethical dilemmas, enabled in part by multimodal large language models (LLMs). These developments highlight not only a gap between technical autonomy and the broader cognitive and social capacities required for human-centered mobility, but also the emergence of a form of vehicle intelligence that currently lacks a clear designation. To address this gap, the paper introduces the concept of agentic vehicles (AgVs): vehicles that exhibit agency, the capacity for goal-driven reasoning, strategic adaptation, self-reflection, and purposeful engagement with complex environments. We conclude by outlining key challenges in the development and governance of AgVs and their potential role in shaping future agentic transportation systems that align with user and societal needs.

HCApr 20
From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces

Jiyao Wang, Song Yan, Xiao Yang et al.

Silent automation failures, where a system fails to detect a hazard without warning, pose a critical safety challenge for partially automated vehicles. While research has mostly focused on takeover requests, how to support a driver in silent failure remains underexplored. We conducted a multi-modal driving simulator study with 48 participants to investigate how different Prospective Situation Awareness Enhancement (PSAE) interfaces, delivered via augmented reality head-up display, affect takeover performance. By integrating behavioral, subjective psychological, and physiological data, our analysis suggests that situational awareness (SA) serves as an important moderating factor through which PSAE interfaces improve takeover performance. Further, we found that providing perceptual cues was most effective in enhancing SA, while communicating system intent was superior for building trust. Finally, we identified a potential correlate of SA in the neuroactivity. Overall, this paper contributes to understanding how transparency-oriented interfaces may support drivers and provides design insights into HMI design for silent failures.

CVMar 9, 2024Code
GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

Hao Lu, Xuesong Niu, Jiyao Wang et al.

Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity.

AIJul 25, 2025Code
PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Jiyao Wang, Xiao Yang, Qingyong Hu et al. · tsinghua

Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.

CVMar 2
VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al.

Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.

CVMay 10, 2024
PhysMLE: Generalizable and Priors-Inclusive Multi-task Remote Physiological Measurement

Jiyao Wang, Hao Lu, Ange Wang et al.

Remote photoplethysmography (rPPG) has been widely applied to measure heart rate from face videos. To increase the generalizability of the algorithms, domain generalization (DG) attracted increasing attention in rPPG. However, when rPPG is extended to simultaneously measure more vital signs (e.g., respiration and blood oxygen saturation), achieving generalizability brings new challenges. Although partial features shared among different physiological signals can benefit multi-task learning, the sparse and imbalanced target label space brings the seesaw effect over task-specific feature learning. To resolve this problem, we designed an end-to-end Mixture of Low-rank Experts for multi-task remote Physiological measurement (PhysMLE), which is based on multiple low-rank experts with a novel router mechanism, thereby enabling the model to adeptly handle both specifications and correlations within tasks. Additionally, we introduced prior knowledge from physiology among tasks to overcome the imbalance of label space under real-world multi-task physiological measurement. For fair and comprehensive evaluations, this paper proposed a large-scale multi-task generalization benchmark, named Multi-Source Synsemantic Domain Generalization (MSSDG) protocol. Extensive experiments with MSSDG and intra-dataset have shown the effectiveness and efficiency of PhysMLE. In addition, a new dataset was collected and made publicly available to meet the needs of the MSSDG.

CVOct 28, 2024
Efficient Mixture-of-Expert for Video-based Driver State and Physiological Multi-task Estimation in Conditional Autonomous Driving

Jiyao Wang, Xiao Yang, Zhenyu Wang et al.

Road safety remains a critical challenge worldwide, with approximately 1.35 million fatalities annually attributed to traffic accidents, often due to human errors. As we advance towards higher levels of vehicle automation, challenges still exist, as driving with automation can cognitively over-demand drivers if they engage in non-driving-related tasks (NDRTs), or lead to drowsiness if driving was the sole task. This calls for the urgent need for an effective Driver Monitoring System (DMS) that can evaluate cognitive load and drowsiness in SAE Level-2/3 autonomous driving contexts. In this study, we propose a novel multi-task DMS, termed VDMoE, which leverages RGB video input to monitor driver states non-invasively. By utilizing key facial features to minimize computational load and integrating remote Photoplethysmography (rPPG) for physiological insights, our approach enhances detection accuracy while maintaining efficiency. Additionally, we optimize the Mixture-of-Experts (MoE) framework to accommodate multi-modal inputs and improve performance across different tasks. A novel prior-inclusive regularization method is introduced to align model outputs with statistical priors, thus accelerating convergence and mitigating overfitting risks. We validate our method with the creation of a new dataset (MCDD), which comprises RGB video and physiological indicators from 42 participants, and two public datasets. Our findings demonstrate the effectiveness of VDMoE in monitoring driver states, contributing to safer autonomous driving systems. The code and data will be released.

CVJun 19, 2025
Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization

Jiyao Wang, Xiao Yang, Hao Lu et al.

Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbf{G} and TTP\textbf{A} employing \textbf{P}riors (\textbf{GAP}) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.

CVJul 10, 2025
Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

Xiao Yang, Jiyao Wang, Yuxuan Fan et al.

Remote physiological measurement (RPM) has emerged as a promising non-invasive method for monitoring physiological signals using the non-contact device. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based RPM models in unseen deployment environments, considerations in aspects such as privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for RPM tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of BVP signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.

SPJun 3, 2025
Towards Generalizable Drowsiness Monitoring with Physiological Sensors: A Preliminary Study

Jiyao Wang, Suzan Ayas, Jiahao Zhang et al.

Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity (EDA), and respiratory (RESP) signals across four datasets, where different drowsiness inducers (such as fatigue and low arousal) and assessment methods (subjective vs. objective) were used. Binary logistic regression models were built to identify the physiological metrics that are associated with drowsiness. Findings indicate that distinct different drowsiness inducers can lead to different physiological responses, and objective assessments were more sensitive than subjective ones in detecting drowsiness. Further, the increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA are robustly associated with increased drowsiness. The results enhance understanding of drowsiness detection and can inform future generalizable monitoring designs.

CVOct 21, 2025
FeatureFool: Zero-Query Fooling of Video Models via Feature Map

Duoxun Tang, Xi Xiao, Guangwu Hu et al.

The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.

CVMay 15, 2025
Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

Hao Lu, Jiaqi Tang, Jiyao Wang et al.

The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.

CVOct 14, 2025
FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements

Xiao Yang, Dengbo He, Jiyao Wang et al.

Remote physiological measurement gained wide attention, while it requires collecting users' privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.

IVFeb 15, 2025
Towards Zero-Shot Task-Generalizable Learning on fMRI

Jiyao Wang, Nicha C. Dvornek, Peiyu Duan et al.

Functional MRI measuring BOLD signal is an increasingly important imaging modality in studying brain functions and neurological disorders. It can be acquired in either a resting-state or a task-based paradigm. Compared to resting-state fMRI, task-based fMRI is acquired while the subject is performing a specific task designed to enhance study-related brain activities. Consequently, it generally has more informative task-dependent signals. However, due to the variety of task designs, it is much more difficult than in resting state to aggregate task-based fMRI acquired in different tasks to train a generalizable model. To resolve this complication, we propose a supervised task-aware network TA-GAT that jointly learns a general-purpose encoder and task-specific contextual information. The encoder-generated embedding and the learned contextual information are then combined as input to multiple modules for performing downstream tasks. We believe that the proposed task-aware architecture can plug-and-play in any neural network architecture to incorporate the prior knowledge of fMRI tasks into capturing functional brain patterns.

AIDec 23, 2024
STAHGNet: Modeling Hybrid-grained Heterogenous Dependency Efficiently for Traffic Prediction

Jiyao Wang, Zehua Peng, Yijia Zhang et al.

Traffic flow prediction plays a critical role in the intelligent transportation system, and it is also a challenging task because of the underlying complex Spatio-temporal patterns and heterogeneities evolving across time. However, most present works mostly concentrate on solely capturing Spatial-temporal dependency or extracting implicit similarity graphs, but the hybrid-granularity evolution is ignored in their modeling process. In this paper, we proposed a novel data-driven end-to-end framework, named Spatio-Temporal Aware Hybrid Graph Network (STAHGNet), to couple the hybrid-grained heterogeneous correlations in series simultaneously through an elaborately Hybrid Graph Attention Module (HGAT) and Coarse-granularity Temporal Graph (CTG) generator. Furthermore, an automotive feature engineering with domain knowledge and a random neighbor sampling strategy is utilized to improve efficiency and reduce computational complexity. The MAE, RMSE, and MAPE are used for evaluation metrics. Tested on four real-life datasets, our proposal outperforms eight classical baselines and four state-of-the-art (SOTA) methods (e.g., MAE 14.82 on PeMSD3; MAE 18.92 on PeMSD4). Besides, extensive experiments and visualizations verify the effectiveness of each component in STAHGNet. In terms of computational cost, STAHGNet saves at least four times the space compared to the previous SOTA models. The proposed model will be beneficial for more efficient TFP as well as intelligent transport system construction.

LGJun 17, 2024
STNAGNN: Data-driven Spatio-temporal Brain Connectivity beyond FC

Jiyao Wang, Nicha C. Dvornek, Peiyu Duan et al.

In recent years, graph neural networks (GNNs) have been widely applied in the analysis of brain fMRI, yet defining the connectivity between ROIs remains a challenge in noisy fMRI data. Among all approaches, Functional Connectome (FC) is the most popular method. Computed by the correlation coefficients between ROI time series, FC is a powerful and computationally efficient way to estimate ROI connectivity. However, it is well known for neglecting structural connections and causality in ROI interactions. Also, FC becomes much more noisy in the short spatio-temporal sliding-window subsequences of fMRI. Effective Connectome (EC) is proposed as a directional alternative, but is difficult to accurately estimate. Furthermore, for optimal GNN performance, usually only a small percentage of the strongest connections are selected as sparse edges, resulting in oversimplification of complex brain connections. To tackle these challenges, we propose the Spatio-Temporal Node Attention Graph Neural Network (STNAGNN) as a data-driven alternative that combines sparse predefined FC with dense data-driven spatio-temporal connections, allowing for flexible and spatio-temporal learning of ROI interaction patterns.