Fuji Ren

CV
h-index21
22papers
327citations
Novelty43%
AI Score55

22 Papers

SDDec 14, 2022
Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Leyuan Qu, Taihao Li, Cornelius Weber et al.

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

AIDec 17, 2022
Graph Learning and Its Advancements on Large Language Models: A Holistic Survey

Shaopeng Wei, Jun Wang, Yu Zhao et al.

Graph learning is a prevalent domain that endeavors to learn the intricate relationships among nodes and the topological structure of graphs. Over the years, graph learning has transcended from graph theory to graph data mining. With the advent of representation learning, it has attained remarkable performance in diverse scenarios. Owing to its extensive application prospects, graph learning attracts copious attention. While some researchers have accomplished impressive surveys on graph learning, they failed to connect related objectives, methods, and applications in a more coherent way. As a result, they did not encompass current ample scenarios and challenging problems due to the rapid expansion of graph learning. Particularly, large language models have recently had a disruptive effect on human life, but they also show relative weakness in structured scenarios. The question of how to make these models more powerful with graph learning remains open. Our survey focuses on the most recent advancements in integrating graph learning with pre-trained language models, specifically emphasizing their application within the domain of large language models. Different from previous surveys on graph learning, we provide a holistic review that analyzes current works from the perspective of graph structure, and discusses the latest applications, trends, and challenges in graph learning. Specifically, we commence by proposing a taxonomy and then summarize the methods employed in graph learning. We then provide a detailed elucidation of mainstream applications. Finally, we propose future directions.

RMNov 28, 2022
A Comprehensive Survey on Enterprise Financial Risk Analysis from Big Data Perspective

Huaming Du, Xingyan Chen, Yu Zhao et al.

Enterprise financial risk analysis aims at predicting the future financial risk of enterprises. Due to its wide and significant application, enterprise financial risk analysis has always been the core research topic in the fields of Finance and Management. Based on advanced computer science and artificial intelligence technologies, enterprise risk analysis research is experiencing rapid developments and making significant progress. Therefore, it is both necessary and challenging to comprehensively review the relevant studies. Although there are already some valuable and impressive surveys on enterprise risk analysis from the perspective of Finance and Management, these surveys introduce approaches in a relatively isolated way and lack recent advances in enterprise financial risk analysis. In contrast, this paper attempts to provide a systematic literature survey of enterprise risk analysis approaches from Big Data perspective, which reviews more than 250 representative articles in the past almost 50 years (from 1968 to 2023). To the best of our knowledge, this is the first and only survey work on enterprise financial risk from Big Data perspective. Specifically, this survey connects and systematizes the existing enterprise financial risk studies, i.e. to summarize and interpret the problems, methods, and spotlights in a comprehensive way. In particular, we first introduce the issues of enterprise financial risks in terms of their types,granularity, intelligence, and evaluation metrics, and summarize the corresponding representative works. Then, we compare the analysis methods used to learn enterprise financial risk, and finally summarize the spotlights of the most representative works. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk generation and contagion.

CLMay 21, 2025Code
Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

He Hu, Yucheng Zhou, Juzheng Si et al.

Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop PsyLLM, we design a novel automated data synthesis pipeline that processes real-world mental health posts collected from Reddit, where users frequently share psychological distress and seek community support. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark. The model weights and dataset have been publicly released at https://github.com/Emo-gml/PsyLLM.

LGJul 4, 2024
Generative Technology for Human Emotion Recognition: A Scope Review

Fei Ma, Yucheng Yuan, Yifan Xie et al.

Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.

CLJan 25, 2025Code
MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Zhongpu Chen, Yinfeng Liu, Long Shi et al.

Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.

HCApr 21, 2020Code
WiFE: WiFi and Vision based Intelligent Facial-Gesture Emotion Recognition

Yu Gu, Xiang Zhang, Zhi Liu et al.

Emotion is an essential part of Artificial Intelligence (AI) and human mental health. Current emotion recognition research mainly focuses on single modality (e.g., facial expression), while human emotion expressions are multi-modal in nature. In this paper, we propose a hybrid emotion recognition system leveraging two emotion-rich and tightly-coupled modalities, i.e., facial expression and body gesture. However, unbiased and fine-grained facial expression and gesture recognition remain a major problem. To this end, unlike our rivals relying on contact or even invasive sensors, we explore the commodity WiFi signal for device-free and contactless gesture recognition, while adopting a vision-based facial expression. However, there exist two design challenges, i.e., how to improve the sensitivity of WiFi signals and how to process the large-volume, heterogeneous, and non-synchronous data contributed by the two-modalities. For the former, we propose a signal sensitivity enhancement method based on the Rician K factor theory; for the latter, we combine CNN and RNN to mine the high-level features of bi-modal data, and perform a score-level fusion for fine-grained recognition. To evaluate the proposed method, we build a first-of-its-kind Vision-CSI Emotion Database (VCED) and conduct extensive experiments. Empirical results show the superiority of the bi-modality by achieving 83.24\% recognition accuracy for seven emotions, as compared with 66.48% and 66.67% recognition accuracy by gesture-only based solution and facial-only based solution, respectively. The VCED database download link is https://github.com/purpleleaves007/WIFE-Dataset.

84.3CRApr 20
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

Wentao Zhang, Yan Zhuang, ZhuHang Zheng et al.

Existing jamming attacks on Retrieval-Augmented Generation (RAG) systems typically induce explicit refusals or denial-of-service behaviors, which are conspicuous and easy to detect. In this work, we formalize a subtler availability threat, termed soft failure, which degrades system utility by inducing fluent and coherent yet non-informative responses rather than overt failures. We propose Deceptive Evolutionary Jamming Attack (DEJA), an automated black-box attack framework that generates adversarial documents to trigger such soft failures by exploiting safety-aligned behaviors of large language models. DEJA employs an evolutionary optimization process guided by a fine-grained Answer Utility Score (AUS), computed via an LLM-based evaluator, to systematically degrade the certainty of answers while maintaining high retrieval success. Extensive experiments across multiple RAG configurations and benchmark datasets show that DEJA consistently drives responses toward low-utility soft failures, achieving SASR above 79\% while keeping hard-failure rates below 15\%, significantly outperforming prior attacks. The resulting adversarial documents exhibit high stealth, evading perplexity-based detection and resisting query paraphrasing, and transfer across model families to proprietary systems without retargeting.

15.0CLApr 3
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

Jiawen Deng, Wentao Zhang, Ziyun Jiao et al.

Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts.

15.9MMMay 7
Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Yan Zhuang, Minhao Liu, Yanru Zhang et al.

Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.

IRDec 18, 2024
Bridging the User-side Knowledge Gap in Knowledge-aware Recommendations with Large Language Models

Zheng Hu, Zhe Li, Ziyun Jiao et al.

In recent years, knowledge graphs have been integrated into recommender systems as item-side auxiliary information, enhancing recommendation accuracy. However, constructing and integrating structural user-side knowledge remains a significant challenge due to the improper granularity and inherent scarcity of user-side features. Recent advancements in Large Language Models (LLMs) offer the potential to bridge this gap by leveraging their human behavior understanding and extensive real-world knowledge. Nevertheless, integrating LLM-generated information into recommender systems presents challenges, including the risk of noisy information and the need for additional knowledge transfer. In this paper, we propose an LLM-based user-side knowledge inference method alongside a carefully designed recommendation framework to address these challenges. Our approach employs LLMs to infer user interests based on historical behaviors, integrating this user-side information with item-side and collaborative data to construct a hybrid structure: the Collaborative Interest Knowledge Graph (CIKG). Furthermore, we propose a CIKG-based recommendation framework that includes a user interest reconstruction module and a cross-domain contrastive learning module to mitigate potential noise and facilitate knowledge transfer. We conduct extensive experiments on three real-world datasets to validate the effectiveness of our method. Our approach achieves state-of-the-art performance compared to competitive baselines, particularly for users with sparse interactions.

46.1IRApr 26
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

Yongsen Pan, Yuxin Chen, Zheng Hu et al.

Modern recommendation systems involve massive catalogs of multimodal items, where scalable item identification must balance compactness, semantic fidelity, and downstream effectiveness. Semantic IDs (SIDs) address this need by representing items as short discrete token sequences derived from multimodal signals, providing a compact interface for retrieval, ranking, and generative recommendation. However, effective SID learning is hindered by collisions, where different items are assigned identical or highly confusable codes. Existing methods mainly rely on improved quantization or fixed overlap regularization, but they do not adaptively distinguish whether an overlap should be suppressed or preserved. We propose AdaSID, an adaptive semantic ID learning framework for recommendation. AdaSID regulates SID overlaps through a two-stage process. First, it relaxes repulsion for observed overlaps when the involved items are semantically compatible, preserving admissible sharing rather than uniformly separating all collisions. Second, it allocates the remaining regulation pressure according to local collision load and training progress, strengthening control in congested regions while gradually rebalancing optimization toward recommendation alignment. This design adaptively decides which overlaps to penalize, how strongly to regulate them, and when to shift the learning focus. Extensive offline and online experiments validate AdaSID. On two public benchmarks, AdaSID improves Recall and NDCG by about 4.5% on average over strong baselines, while improving codebook utilization and SID diversity. In Kuaishou e-commerce, an online A/B test on short-video retrieval covering tens of millions of users achieves statistically significant gains, including a 0.98% GMV improvement, and industrial ranking evaluation shows consistent AUC improvements.

LGDec 10, 2024
A Review of Human Emotion Synthesis Based on Generative Technology

Fei Ma, Yukan Li, Yifan Xie et al.

Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.

CVApr 12, 2024
MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

Linhuang Wang, Xin Kang, Fei Ding et al.

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

17.1CLApr 10
EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue

Jiawen Deng, Wei Li, Wentao Zhang et al.

Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textsc{EthicMind}, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textsc{EthicMind} jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textsc{EthicMind} achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.

CVAug 23, 2025
SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Peng Hu, Yu Gu, Liang Luo et al.

Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.

CVMar 14, 2025
Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset

Yibing Weng, Yu Gu, Fuji Ren

Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.

CVJan 2, 2024
SSP: A Simple and Safe automatic Prompt engineering method towards realistic image synthesis on LVM

Weijin Cheng, Jianzhi Liu, Jiawen Deng et al.

Recently, text-to-image (T2I) synthesis has undergone significant advancements, particularly with the emergence of Large Language Models (LLM) and their enhancement in Large Vision Models (LVM), greatly enhancing the instruction-following capabilities of traditional T2I models. Nevertheless, previous methods focus on improving generation quality but introduce unsafe factors into prompts. We explore that appending specific camera descriptions to prompts can enhance safety performance. Consequently, we propose a simple and safe prompt engineering method (SSP) to improve image generation quality by providing optimal camera descriptions. Specifically, we create a dataset from multi-datasets as original prompts. To select the optimal camera, we design an optimal camera matching approach and implement a classifier for original prompts capable of automatically matching. Appending camera descriptions to original prompts generates optimized prompts for further LVM image generation. Experiments demonstrate that SSP improves semantic consistency by an average of 16% compared to others and safety metrics by 48.9%.

HCAug 27, 2019
EmoSense: Computational Intelligence Driven Emotion Sensing via Wireless Channel Data

Yu Gu, Yantong Wang, Tao Liu et al.

Emotion is well-recognized as a distinguished symbol of human beings, and it plays a crucial role in our daily lives. Existing vision-based or sensor-based solutions are either obstructive to use or rely on specialized hardware, hindering their applicability. This paper introduces EmoSense, a first-of-its-kind wireless emotion sensing system driven by computational intelligence. The basic methodology is to explore the physical expression of emotions from wireless channel response via data mining. The design and implementation of EmoSense {face} two major challenges: extracting physical expression from wireless channel data and recovering emotion from the corresponding physical expression. For the former, we present a Fresnel zone based theoretical model depicting the fingerprint of the physical expression on channel response. For the latter, we design an efficient computational intelligence driven mechanism to recognize emotion from the corresponding fingerprints. We prototyped EmoSense on the commodity WiFi infrastructure and compared it with main-stream sensor-based and vision-based approaches in the real-world scenario. The numerical study over $3360$ cases confirms that EmoSense achieves a comparable performance to the vision-based and sensor-based rivals under different scenarios. EmoSense only leverages the low-cost and prevalent WiFi infrastructures and thus constitutes a tempting solution for emotion sensing.

HCAug 14, 2019
WiFi-based Real-time Breathing and Heart Rate Monitoring during Sleep

Yu Gu, Xiang Zhang, Zhi Liu et al.

Good quality sleep is essential for good health and sleep monitoring becomes a vital research topic. This paper provides a low cost, continuous and contactless WiFi-based vital signs (breathing and heart rate) monitoring method. In particular, we set up the antennas based on Fresnel diffraction model and signal propagation theory, which enhances the detection of weak breathing/heartbeat motion. We implement a prototype system using the off-shelf devices and a real-time processing system to monitor vital signs in real time. The experimental results indicate the accurate breathing rate and heart rate detection performance. To the best of our knowledge, this is the first work to use a pair of WiFi devices and omnidirectional antennas to achieve real-time individual breathing rate and heart rate monitoring in different sleeping postures.

HCJul 13, 2019
BeSense: Leveraging WiFi Channel Data and Computational Intelligence for Behavior Analysis

Yu Gu, Xiang Zhang, Zhi Liu et al.

The ever evolving informatics technology has gradually bounded human and computer in a compact way. Understanding user behavior becomes a key enabler in many fields such as sedentary-related healthcare, human-computer interaction (HCI) and affective computing. Traditional sensor-based and vision-based user behavior analysis approaches are obtrusive in general, hindering their usage in realworld. Therefore, in this article, we first introduce WiFi signal as a new source instead of sensor and vision for unobtrusive user behaviors analysis. Then we design BeSense, a contactless behavior analysis system leveraging signal processing and computational intelligence over WiFi channel state information (CSI). We prototype BeSense on commodity low-cost WiFi devices and evaluate its performance in realworld environments. Experimental results have verified its effectiveness in recognizing user behaviors.

CVNov 29, 2018
Two-level Attention with Two-stage Multi-task Learning for Facial Emotion Recognition

Xiaohua Wang, Muzi Peng, Lijuan Pan et al.

Compared with facial emotion recognition on categorical model, the dimensional emotion recognition can describe numerous emotions of the real world more accurately. Most prior works of dimensional emotion estimation only considered laboratory data and used video, speech or other multi-modal features. The effect of these methods applied on static images in the real world is unknown. In this paper, a two-level attention with two-stage multi-task learning (2Att-2Mt) framework is proposed for facial emotion estimation on only static images. Firstly, the features of corresponding region(position-level features) are extracted and enhanced automatically by first-level attention mechanism. In the following, we utilize Bi-directional Recurrent Neural Network(Bi-RNN) with self-attention(second-level attention) to make full use of the relationship features of different layers(layer-level features) adaptively. Owing to the inherent complexity of dimensional emotion recognition, we propose a two-stage multi-task learning structure to exploited categorical representations to ameliorate the dimensional representations and estimate valence and arousal simultaneously in view of the correlation of the two targets. The quantitative results conducted on AffectNet dataset show significant advancement on Concordance Correlation Coefficient(CCC) and Root Mean Square Error(RMSE), illustrating the superiority of the proposed framework. Besides, extensive comparative experiments have also fully demonstrated the effectiveness of different components.