CVApr 10, 2022
CholecTriplet2021: A benchmark challenge for surgical action triplet recognitionChinedu Innocent Nwoye, Deepak Alapatt, Tong Yu et al.
Context-aware decision support in the operating room can foster surgical safety and efficiency by leveraging real-time feedback from surgical workflow analysis. Most existing works recognize surgical activities at a coarse-grained level, such as phases, steps or events, leaving out fine-grained interaction details about the surgical activity; yet those are needed for more helpful AI assistance in the operating room. Recognizing surgical actions as triplets of <instrument, verb, target> combination delivers comprehensive details about the activities taking place in surgical videos. This paper presents CholecTriplet2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. The challenge granted private access to the large-scale CholecT50 dataset, which is annotated with action triplet information. In this paper, we present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods from the challenge organizers and 19 new deep learning algorithms by competing teams are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%. This study also analyzes the significance of the results obtained by the presented approaches, performs a thorough methodological comparison between them, in-depth result analysis, and proposes a novel ensemble method for enhanced recognition. Our analysis shows that surgical workflow analysis is not yet solved, and also highlights interesting directions for future research on fine-grained surgical activity recognition which is of utmost importance for the development of AI in surgery.
CVOct 27, 2022
Facial Video-based Remote Physiological Measurement via Self-supervised LearningZijie Yue, Miaojing Shi, Shuai Ding
Facial video-based remote physiological measurement aims to estimate remote photoplethysmography (rPPG) signals from human face videos and then measure multiple vital signs (e.g. heart rate, respiration frequency) from rPPG signals. Recent approaches achieve it by training deep neural networks, which normally require abundant facial videos and synchronously recorded photoplethysmography (PPG) signals for supervision. However, the collection of these annotated corpora is not easy in practice. In this paper, we introduce a novel frequency-inspired self-supervised framework that learns to estimate rPPG signals from facial videos without the need of ground truth PPG signals. Given a video sample, we first augment it into multiple positive/negative samples which contain similar/dissimilar signal frequencies to the original one. Specifically, positive samples are generated using spatial augmentation. Negative samples are generated via a learnable frequency augmentation module, which performs non-linear signal frequency transformation on the input without excessively changing its visual appearance. Next, we introduce a local rPPG expert aggregation module to estimate rPPG signals from augmented samples. It encodes complementary pulsation information from different face regions and aggregate them into one rPPG prediction. Finally, we propose a series of frequency-inspired losses, i.e. frequency contrastive loss, frequency ratio consistency loss, and cross-video frequency agreement loss, for the optimization of estimated rPPG signals from multiple augmented video samples and across temporally neighboring video samples. We conduct rPPG-based heart rate, heart rate variability and respiration frequency estimation on four standard benchmarks. The experimental results demonstrate that our method improves the state of the art by a large margin.
NIAug 15, 2024Code
System States Forecasting of Microservices with Dynamic Spatio-Temporal DataYifei Xu, Jingguo Ge, Haina Tang et al.
In the AIOps (Artificial Intelligence for IT Operations) era, accurately forecasting system states is crucial. In microservices systems, this task encounters the challenge of dynamic and complex spatio-temporal relationships among microservice instances, primarily due to dynamic deployments, diverse call paths, and cascading effects among instances. Current time-series forecasting methods, which focus mainly on intrinsic patterns, are insufficient in environments where spatial relationships are critical. Similarly, spatio-temporal graph approaches often neglect the nature of temporal trend, concentrating mostly on message passing between nodes. Moreover, current research in microservices domain frequently underestimates the importance of network metrics and topological structures in capturing the evolving dynamics of systems. This paper introduces STMformer, a model tailored for forecasting system states in microservices environments, capable of handling multi-node and multivariate time series. Our method leverages dynamic network connection data and topological information to assist in modeling the intricate spatio-temporal relationships within the system. Additionally, we integrate the PatchCrossAttention module to compute the impact of cascading effects globally. We have developed a dataset based on a microservices system and conducted comprehensive experiments with STMformer against leading methods. In both short-term and long-term forecasting tasks, our model consistently achieved a 8.6% reduction in MAE(Mean Absolute Error) and a 2.2% reduction in MSE (Mean Squared Error). The source code is available at https://github.com/xuyifeiiie/STMformer.
IRApr 18, 2023
Integrity and Junkiness Failure Handling for Embedding-based Retrieval: A Case Study in Social Network SearchWenping Wang, Yunxi Guo, Chiyao Shen et al.
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc. While the approach has demonstrated its efficacy in tasks like semantic matching and contextual search, it is plagued by the problem of uncontrollable relevance. In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine, and define two main categories of failures introduced by it, integrity and junkiness. The former refers to issues such as hate speech and offensive content that can severely harm user experience, while the latter includes irrelevant results like fuzzy text matching or language mismatches. Efficient methods during model inference are further proposed to resolve the issue, including indexing treatments and targeted user cohort treatments, etc. Though being simple, we show the methods have good offline NDCG and online A/B tests metrics gain in practice. We analyze the reasons for the improvements, pointing out that our methods are only preliminary attempts to this important but challenging problem. We put forward potential future directions to explore.
CVJul 11, 2024
Bootstrapping Vision-language Models for Self-supervised Remote Physiological MeasurementZijie Yue, Miaojing Shi, Hanli Wang et al.
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
LGMar 16, 2023
A Multimodal Data-driven Framework for Anxiety ScreeningHaimiao Mo, Shuai Ding, Siu Cheung Hui
Early screening for anxiety and appropriate interventions are essential to reduce the incidence of self-harm and suicide in patients. Due to limited medical resources, traditional methods that overly rely on physician expertise and specialized equipment cannot simultaneously meet the needs for high accuracy and model interpretability. Multimodal data can provide more objective evidence for anxiety screening to improve the accuracy of models. The large amount of noise in multimodal data and the unbalanced nature of the data make the model prone to overfitting. However, it is a non-differentiable problem when high-dimensional and multimodal feature combinations are used as model inputs and incorporated into model training. This causes existing anxiety screening methods based on machine learning and deep learning to be inapplicable. Therefore, we propose a multimodal data-driven anxiety screening framework, namely MMD-AS, and conduct experiments on the collected health data of over 200 seafarers by smartphones. The proposed framework's feature extraction, dimension reduction, feature selection, and anxiety inference are jointly trained to improve the model's performance. In the feature selection step, a feature selection method based on the Improved Fireworks Algorithm is used to solve the non-differentiable problem of feature combination to remove redundant features and search for the ideal feature subset. The experimental results show that our framework outperforms the comparison methods.
CVAug 12, 2022
SFF-DA: Sptialtemporal Feature Fusion for Detecting Anxiety NonintrusivelyHaimiao Mo, Yuchen Li, Shanlin Yang et al.
Early detection of anxiety is crucial for reducing the suffering of individuals with mental disorders and improving treatment outcomes. Utilizing an mHealth platform for anxiety screening can be particularly practical in improving screening efficiency and reducing costs. However, the effectiveness of existing methods has been hindered by differences in mobile devices used to capture subjects' physical and mental evaluations, as well as by the variability in data quality and small sample size problems encountered in real-world settings. To address these issues, we propose a framework with spatiotemporal feature fusion for detecting anxiety nonintrusively. We use a feature extraction network based on a 3D convolutional network and long short-term memory ("3DCNN+LSTM") to fuse the spatiotemporal features of facial behavior and noncontact physiology, which reduces the impact of uneven data quality. Additionally, we design a similarity assessment strategy to address the issue of deteriorating model accuracy due to small sample sizes. Our framework is validated with a crew dataset from the real world and two public datasets: the University of Burgundy Franche-Comté Psychophysiological (UBFC-Phys) dataset and the Smart Reasoning for Well-being at Home and at Work for Knowledge Work (SWELL-KW) dataset. The experimental results indicate that our framework outperforms the comparison methods.
CVMay 8, 2024
Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton SequencesCheng Song, Lu Lu, Zhen Ke et al.
Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
ROMar 24, 2021
Single-Shot is Enough: Panoramic Infrastructure Based Calibration of Multiple Cameras and 3D LiDARsChuan Fang, Shuai Ding, Zilong Dong et al.
The integration of multiple cameras and 3D Li- DARs has become basic configuration of augmented reality devices, robotics, and autonomous vehicles. The calibration of multi-modal sensors is crucial for a system to properly function, but it remains tedious and impractical for mass production. Moreover, most devices require re-calibration after usage for certain period of time. In this paper, we propose a single-shot solution for calibrating extrinsic transformations among multiple cameras and 3D LiDARs. We establish a panoramic infrastructure, in which a camera or LiDAR can be robustly localized using data from single frame. Experiments are conducted on three devices with different camera-LiDAR configurations, showing that our approach achieved comparable calibration accuracy with the state-of-the-art approaches but with much greater efficiency.
RONov 17, 2020
Collaborative Three-Tier Architecture Non-contact Respiratory Rate Monitoring using Target Tracking and False Peaks Eliminating AlgorithmsHaimiao Mo, Shuai Ding, Shanlin Yang et al.
Monitoring the respiratory rate is crucial for helping us identify respiratory disorders. Devices for conventional respiratory monitoring are inconvenient and scarcely available. Recent research has demonstrated the ability of non-contact technologies, such as photoplethysmography and infrared thermography, to gather respiratory signals from the face and monitor breathing. However, the current non-contact respiratory monitoring techniques have poor accuracy because they are sensitive to environmental influences like lighting and motion artifacts. Furthermore, frequent contact between users and the cloud in real-world medical application settings might cause service request delays and potentially the loss of personal data. We proposed a non-contact respiratory rate monitoring system with a cooperative three-layer design to increase the precision of respiratory monitoring and decrease data transmission latency. To reduce data transmission and network latency, our three-tier architecture layer-by-layer decomposes the computing tasks of respiration monitoring. Moreover, we improved the accuracy of respiratory monitoring by designing a target tracking algorithm and an algorithm for eliminating false peaks to extract high-quality respiratory signals. By gathering the data and choosing several regions of interest on the face, we were able to extract the respiration signal and investigate how different regions affected the monitoring of respiration. The results of the experiment indicate that when the nasal region is used to extract the respiratory signal, it performs experimentally best. Our approach performs better than rival approaches while transferring fewer data.