CVFeb 28, 2023
HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose EstimationKai Zhai, Qiang Nie, Bo Ouyang et al. · pku
2D-to-3D human pose lifting is fundamental for 3D human pose estimation (HPE), for which graph convolutional networks (GCNs) have proven inherently suitable for modeling the human skeletal topology. However, the current GCN-based 3D HPE methods update the node features by aggregating their neighbors' information without considering the interaction of joints in different joint synergies. Although some studies have proposed importing limb information to learn the movement patterns, the latent synergies among joints, such as maintaining balance are seldom investigated. We propose the Hop-wise GraphFormer with Intragroup Joint Refinement (HopFIR) architecture to tackle the 3D HPE problem. HopFIR mainly consists of a novel hop-wise GraphFormer (HGF) module and an intragroup joint refinement (IJR) module. The HGF module groups the joints by k-hop neighbors and applies a hopwise transformer-like attention mechanism to these groups to discover latent joint synergies. The IJR module leverages the prior limb information for peripheral joint refinement. Extensive experimental results show that HopFIR outperforms the SOTA methods by a large margin, with a mean per-joint position error (MPJPE) on the Human3.6M dataset of 32.67 mm. We also demonstrate that the state-of-the-art GCN-based methods can benefit from the proposed hop-wise attention mechanism with a significant improvement in performance: SemGCN and MGCN are improved by 8.9% and 4.5%, respectively.
CVJul 11, 2024
Bootstrapping Vision-language Models for Self-supervised Remote Physiological MeasurementZijie Yue, Miaojing Shi, Hanli Wang et al.
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
CVAug 12, 2022
SFF-DA: Sptialtemporal Feature Fusion for Detecting Anxiety NonintrusivelyHaimiao Mo, Yuchen Li, Shanlin Yang et al.
Early detection of anxiety is crucial for reducing the suffering of individuals with mental disorders and improving treatment outcomes. Utilizing an mHealth platform for anxiety screening can be particularly practical in improving screening efficiency and reducing costs. However, the effectiveness of existing methods has been hindered by differences in mobile devices used to capture subjects' physical and mental evaluations, as well as by the variability in data quality and small sample size problems encountered in real-world settings. To address these issues, we propose a framework with spatiotemporal feature fusion for detecting anxiety nonintrusively. We use a feature extraction network based on a 3D convolutional network and long short-term memory ("3DCNN+LSTM") to fuse the spatiotemporal features of facial behavior and noncontact physiology, which reduces the impact of uneven data quality. Additionally, we design a similarity assessment strategy to address the issue of deteriorating model accuracy due to small sample sizes. Our framework is validated with a crew dataset from the real world and two public datasets: the University of Burgundy Franche-Comté Psychophysiological (UBFC-Phys) dataset and the Smart Reasoning for Well-being at Home and at Work for Knowledge Work (SWELL-KW) dataset. The experimental results indicate that our framework outperforms the comparison methods.
LGMar 15, 2025
Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning MechanismsXiaojian Li, Yongkang Leng, Ruiqing Ding et al.
The human-like reasoning capabilities exhibited by Large Language Models (LLMs) challenge the traditional neural network theory's understanding of the flexibility of fixed-parameter systems. This paper proposes the "Cognitive Activation" theory, revealing the essence of LLMs' reasoning mechanisms from the perspective of dynamic systems: the model's reasoning ability stems from a chaotic process of dynamic information extraction in the parameter space. By introducing the Quasi-Lyapunov Exponent (QLE), we quantitatively analyze the chaotic characteristics of the model at different layers. Experiments show that the model's information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output than the attention mechanism. Further experiments indicate that minor initial value perturbations will have a substantial impact on the model's reasoning ability, confirming the theoretical analysis that large language models are chaotic systems. This research provides a chaos theory framework for the interpretability of LLMs' reasoning and reveals potential pathways for balancing creativity and reliability in model design.
RONov 17, 2020
Collaborative Three-Tier Architecture Non-contact Respiratory Rate Monitoring using Target Tracking and False Peaks Eliminating AlgorithmsHaimiao Mo, Shuai Ding, Shanlin Yang et al.
Monitoring the respiratory rate is crucial for helping us identify respiratory disorders. Devices for conventional respiratory monitoring are inconvenient and scarcely available. Recent research has demonstrated the ability of non-contact technologies, such as photoplethysmography and infrared thermography, to gather respiratory signals from the face and monitor breathing. However, the current non-contact respiratory monitoring techniques have poor accuracy because they are sensitive to environmental influences like lighting and motion artifacts. Furthermore, frequent contact between users and the cloud in real-world medical application settings might cause service request delays and potentially the loss of personal data. We proposed a non-contact respiratory rate monitoring system with a cooperative three-layer design to increase the precision of respiratory monitoring and decrease data transmission latency. To reduce data transmission and network latency, our three-tier architecture layer-by-layer decomposes the computing tasks of respiration monitoring. Moreover, we improved the accuracy of respiratory monitoring by designing a target tracking algorithm and an algorithm for eliminating false peaks to extract high-quality respiratory signals. By gathering the data and choosing several regions of interest on the face, we were able to extract the respiration signal and investigate how different regions affected the monitoring of respiration. The results of the experiment indicate that when the nasal region is used to extract the respiratory signal, it performs experimentally best. Our approach performs better than rival approaches while transferring fewer data.