CVSep 18, 2024Code
SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal ConsistencyYiping Xie, Zitong Yu, Bingjie Wu et al.
Remote Photoplethysmography (rPPG) is a non-contact method that uses facial video to predict changes in blood volume, enabling physiological metrics measurement. Traditional rPPG models often struggle with poor generalization capacity in unseen domains. Current solutions to this problem is to improve its generalization in the target domain through Domain Generalization (DG) or Domain Adaptation (DA). However, both traditional methods require access to both source domain data and target domain data, which cannot be implemented in scenarios with limited access to source data, and another issue is the privacy of accessing source domain data. In this paper, we propose the first Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which overcomes these limitations by enabling effective domain adaptation without access to source domain data. Our framework incorporates a Three-Branch Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency across domains. Furthermore, we propose a new rPPG distribution alignment loss based on the Frequency-domain Wasserstein Distance (FWD), which leverages optimal transport to align power spectrum distributions across domains effectively and further enforces the alignment of the three branches. Extensive cross-domain experiments and ablation studies demonstrate the effectiveness of our proposed method in source-free domain adaptation settings. Our findings highlight the significant contribution of the proposed FWD loss for distributional alignment, providing a valuable reference for future research and applications. The source code is available at https://github.com/XieYiping66/SFDA-rPPG
CVSep 18, 2024Code
PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference MambaChaoqi Luo, Yiping Xie, Zitong Yu
Facial-video based Remote photoplethysmography (rPPG) aims at measuring physiological signals and monitoring heart activity without any contact, showing significant potential in various applications. Previous deep learning based rPPG measurement are primarily based on CNNs and Transformers. However, the limited receptive fields of CNNs restrict their ability to capture long-range spatio-temporal dependencies, while Transformers also struggle with modeling long video sequences with high complexity. Recently, the state space models (SSMs) represented by Mamba are known for their impressive performance on capturing long-range dependencies from long sequences. In this paper, we propose the PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Specifically, we introduce the Temporal Difference Mamba block to first enhance local dynamic differences and further model the long-range spatio-temporal context. Moreover, a dual-stream SlowFast architecture is utilized to fuse the multi-scale temporal features. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba. The codes are available at https://github.com/Chaoqi31/PhysMamba
87.1CVMar 31Code
SecAgent: Efficient Mobile GUI Agent with Semantic ContextYiping Xie, Song Chen, Jingxuan Xing et al.
Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. Our dataset is available at https://huggingface.co/datasets/alibabagroup/CMGUI.
ROJun 15, 2022
Neural Network Normal Estimation and Bathymetry Reconstruction from Sidescan SonarYiping Xie, Nils Bore, John Folkesson
Sidescan sonar intensity encodes information about the changes of surface normal of the seabed. However, other factors such as seabed geometry as well as its material composition also affect the return intensity. One can model these intensity changes in a forward direction from the surface normals from bathymetric map and physical properties to the measured intensity or alternatively one can use an inverse model which starts from the intensities and models the surface normals. Here we use an inverse model which leverages deep learning's ability to learn from data; a convolutional neural network is used to estimate the surface normal from the sidescan. Thus the internal properties of the seabed are only implicitly learned. Once this information is estimated, a bathymetric map can be reconstructed through an optimization framework that also includes altimeter readings to provide a sparse depth profile as a constraint. Implicit neural representation learning was recently proposed to represent the bathymetric map in such an optimization framework. In this article, we use a neural network to represent the map and optimize it under constraints of altimeter points and estimated surface normal from sidescan. By fusing multiple observations from different angles from several sidescan lines, the estimated results are improved through optimization. We demonstrate the efficiency and scalability of the approach by reconstructing a high-quality bathymetry using sidescan data from a large sidescan survey. We compare the proposed data-driven inverse model approach of modeling a sidescan with a forward Lambertian model. We assess the quality of each reconstruction by comparing it with data constructed from a multibeam sensor.
ROJun 15, 2022
High-Resolution Bathymetric Reconstruction From Sidescan Sonar With Deep Neural NetworksYiping Xie, Nils Bore, John Folkesson
We propose a novel data-driven approach for high-resolution bathymetric reconstruction from sidescan. Sidescan sonar (SSS) intensities as a function of range do contain some information about the slope of the seabed. However, that information must be inferred. Additionally, the navigation system provides the estimated trajectory, and normally the altitude along this trajectory is also available. From these we obtain a very coarse seabed bathymetry as an input. This is then combined with the indirect but high-resolution seabed slope information from the sidescan to estimate the full bathymetry. This sparse depth could be acquired by single-beam echo sounder, Doppler Velocity Log (DVL), other bottom tracking sensors or bottom tracking algorithm from sidescan itself. In our work, a fully convolutional network is used to estimate the depth contour and its aleatoric uncertainty from the sidescan images and sparse depth in an end-to-end fashion. The estimated depth is then used together with the range to calculate the point's 3D location on the seafloor. A high-quality bathymetric map can be reconstructed after fusing the depth predictions and the corresponding confidence measures from the neural networks. We show the improvement of the bathymetric map gained by using sparse depths with sidescan over estimates with sidescan alone. We also show the benefit of confidence weighting when fusing multiple bathymetric estimates into a single map.
CVApr 18, 2023
Evaluation of a Canonical Image Representation for Sidescan SonarWeiqi Xu, Li Ling, Yiping Xie et al.
Acoustic sensors play an important role in autonomous underwater vehicles (AUVs). Sidescan sonar (SSS) detects a wide range and provides photo-realistic images in high resolution. However, SSS projects the 3D seafloor to 2D images, which are distorted by the AUV's altitude, target's range and sensor's resolution. As a result, the same physical area can show significant visual differences in SSS images from different survey lines, causing difficulties in tasks such as pixel correspondence and template matching. In this paper, a canonical transformation method consisting of intensity correction and slant range correction is proposed to decrease the above distortion. The intensity correction includes beam pattern correction and incident angle correction using three different Lambertian laws (cos, cos2, cot), whereas the slant range correction removes the nadir zone and projects the position of SSS elements into equally horizontally spaced, view-point independent bins. The proposed method is evaluated on real data collected by a HUGIN AUV, with manually-annotated pixel correspondence as ground truth reference. Experimental results on patch pairs compare similarity measures and keypoint descriptor matching. The results show that the canonical transformation can improve the patch similarity, as well as SIFT descriptor matching accuracy in different images where the same physical area was ensonified.
LGSep 20, 2024
Score-Based Multibeam Point Cloud DenoisingLi Ling, Yiping Xie, Nils Bore et al.
Multibeam echo-sounder (MBES) is the de-facto sensor for bathymetry mapping. In recent years, cheaper MBES sensors and global mapping initiatives have led to exponential growth of available data. However, raw MBES data contains 1-25% of noise that requires semi-automatic filtering using tools such as Combined Uncertainty and Bathymetric Estimator (CUBE). In this work, we draw inspirations from the 3D point cloud community and adapted a score-based point cloud denoising network for MBES outlier detection and denoising. We trained and evaluated this network on real MBES survey data. The proposed method was found to outperform classical methods, and can be readily integrated into existing MBES standard workflow. To facilitate future research, the code and pretrained model are available online.
CVFeb 19, 2025Code
CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological MeasurementZheng Wu, Yiping Xie, Bo Zhao et al.
Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: https://github.com/WuZheng42/CardiacMamba.
CVMay 6, 2025
PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological SensingYiping Xie, Bo Zhao, Mingtong Dai et al.
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
CVFeb 6, 2025
Semi-rPPG: Semi-Supervised Remote Physiological Measurement with Curriculum Pseudo-LabelingBingjie Wu, Zitong Yu, Yiping Xie et al.
Remote Photoplethysmography (rPPG) is a promising technique to monitor physiological signals such as heart rate from facial videos. However, the labeled facial videos in this research are challenging to collect. Current rPPG research is mainly based on several small public datasets collected in simple environments, which limits the generalization and scale of the AI models. Semi-supervised methods that leverage a small amount of labeled data and abundant unlabeled data can fill this gap for rPPG learning. In this study, a novel semi-supervised learning method named Semi-rPPG that combines curriculum pseudo-labeling and consistency regularization is proposed to extract intrinsic physiological features from unlabelled data without impairing the model from noises. Specifically, a curriculum pseudo-labeling strategy with signal-to-noise ratio (SNR) criteria is proposed to annotate the unlabelled data while adaptively filtering out the low-quality unlabelled data. Besides, a novel consistency regularization term for quasi-periodic signals is proposed through weak and strong augmented clips. To benefit the research on semi-supervised rPPG measurement, we establish a novel semi-supervised benchmark for rPPG learning through intra-dataset and cross-dataset evaluation on four public datasets. The proposed Semi-rPPG method achieves the best results compared with three classical semi-supervised methods under different protocols. Ablation studies are conducted to prove the effectiveness of the proposed methods.
LGApr 13, 2021
Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical ReportChao Xu, Yiping Xie, Xijun Wang et al.
Recently, we have struck the balance between the information freshness, in terms of age of information (AoI), experienced by users and energy consumed by sensors, by appropriately activating sensors to update their current status in caching enabled Internet of Things (IoT) networks [1]. To solve this problem, we cast the corresponding status update procedure as a continuing Markov Decision Process (MDP) (i.e., without termination states), where the number of state-action pairs increases exponentially with respect to the number of considered sensors and users. Moreover, to circumvent the curse of dimensionality, we have established a methodology for designing deep reinforcement learning (DRL) algorithms to maximize (resp. minimize) the average reward (resp. cost), by integrating R-learning, a tabular reinforcement learning (RL) algorithm tailored for maximizing the long-term average reward, and traditional DRL algorithms, initially developed to optimize the discounted long-term cumulative reward rather than the average one. In this technical report, we would present detailed discussions on the technical contributions of this methodology.