Juan Liu

CV
h-index30
30papers
672citations
Novelty44%
AI Score56

30 Papers

ASJun 4
M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

Fei Su, Cancan Li, Juan Liu et al.

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

CLAug 29, 2022
StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse Representations and Content Enhancing

Xuekai Zhu, Jian Guan, Minlie Huang et al. · tsinghua

Non-parallel text style transfer is an important task in natural language generation. However, previous studies concentrate on the token or sentence level, such as sentence sentiment and formality transfer, but neglect long style transfer at the discourse level. Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences. In this paper, we formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style while maintaining source semantics. To tackle this problem, we propose a generation model, named StoryTrans, which leverages discourse representations to capture source content information and transfer them to target styles with learnable style embeddings. We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder. Moreover, to enhance content preservation, we design a mask-and-fill framework to explicitly fuse style-specific keywords of source texts into generation. Furthermore, we constructed new datasets for this task in Chinese and English, respectively. Extensive experiments show that our model outperforms strong baselines in overall performance of style transfer and content preservation.

CVMar 23, 2023
Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration

Guofeng Mei, Hao Tang, Xiaoshui Huang et al.

Deep point cloud registration methods face challenges to partial overlaps and rely on labeled data. To address these issues, we propose UDPReg, an unsupervised deep probabilistic registration framework for point clouds with partial overlaps. Specifically, we first adopt a network to learn posterior probability distributions of Gaussian mixture models (GMMs) from point clouds. To handle partial point cloud registration, we apply the Sinkhorn algorithm to predict the distribution-level correspondences under the constraint of the mixing weights of GMMs. To enable unsupervised learning, we design three distribution consistency-based losses: self-consistency, cross-consistency, and local contrastive. The self-consistency loss is formulated by encouraging GMMs in Euclidean and feature spaces to share identical posterior distributions. The cross-consistency loss derives from the fact that the points of two partially overlapping point clouds belonging to the same clusters share the cluster centroids. The cross-consistency loss allows the network to flexibly learn a transformation-invariant posterior distribution of two aligned point clouds. The local contrastive loss facilitates the network to extract discriminative local features. Our UDPReg achieves competitive performance on the 3DMatch/3DLoMatch and ModelNet/ModelLoNet benchmarks.

NAApr 7, 2018
Spectral Indicator Method for A Non-selfadjoint Steklov Eigenvalue Problem

Juan Liu, Jiguang Sun, Tiara Turner

We propose an efficient numerical method for a non-selfadjoint Steklov eigenvalue problem. The Lagrange finite element is used for discretization. The convergence is proved using the spectral perturbation theory for compact operators. The non-sefadjointness of the problem leads to non-Hermitian matrix eigenvalue problem. Due to the existence of complex eigenvalues and lack of a priori spectral information, we propose a modified version of the recently developed spectral indicator method to compute (complex) eigenvalues in a given region on the complex plane. In particular, to reduce computational cost, the problem is transformed into a much smaller matrix eigenvalue problem involving the unknowns only on the boundary of the domain. Numerical examples are presented to validate the effectiveness of the proposed method.

ROJul 16, 2024
FR-SLAM: A SLAM Improvement Method Based on Floor Plan Registration

Jiantao Feng, Xinde Li, HyunCheol Park et al.

Simultaneous Localization and Mapping (SLAM) technology enables the construction of environmental maps and localization, serving as a key technique for indoor autonomous navigation of mobile robots. Traditional SLAM methods typically require exhaustive traversal of all rooms during indoor navigation to obtain a complete map, resulting in lengthy path planning times and prolonged time to reach target points. Moreover, cumulative errors during motion lead to inaccurate robot localization, impacting navigation efficiency.This paper proposes an improved SLAM method, FR-SLAM, based on floor plan registration, utilizing a morphology-based floor plan registration algorithm to align and transform original floor plans. This approach facilitates the rapid acquisition of comprehensive motion maps and efficient path planning, enabling swift navigation to target positions within a shorter timeframe. To enhance registration and robot motion localization accuracy, a real-time update strategy is employed, comparing the current position's building structure with the map and dynamically updating floor plan registration results for precise localization. Comparative tests conducted on real and simulated datasets demonstrate that, compared to other benchmark algorithms, this method achieves higher floor plan registration accuracy and shorter time consumption to reach target positions.

CVSep 28, 2022
Deeply Supervised Layer Selective Attention Network: Towards Label-Efficient Learning for Medical Image Classification

Peng Jiang, Juan Liu, Lang Wang et al.

Labeling medical images depends on professional knowledge, making it difficult to acquire large amount of annotated medical images with high quality in a short time. Thus, making good use of limited labeled samples in a small dataset to build a high-performance model is the key to medical image classification problem. In this paper, we propose a deeply supervised Layer Selective Attention Network (LSANet), which comprehensively uses label information in feature-level and prediction-level supervision. For feature-level supervision, in order to better fuse the low-level features and high-level features, we propose a novel visual attention module, Layer Selective Attention (LSA), to focus on the feature selection of different layers. LSA introduces a weight allocation scheme which can dynamically adjust the weighting factor of each auxiliary branch during the whole training process to further enhance deeply supervised learning and ensure its generalization. For prediction-level supervision, we adopt the knowledge synergy strategy to promote hierarchical information interactions among all supervision branches via pairwise knowledge matching. Using the public dataset, MedMNIST, which is a large-scale benchmark for biomedical image classification covering diverse medical specialties, we evaluate LSANet on multiple mainstream CNN architectures and various visual attention modules. The experimental results show the substantial improvements of our proposed method over its corresponding counterparts, demonstrating that LSANet can provide a promising solution for label-efficient learning in the field of medical image classification.

HCApr 5
Exploring a Gamified Personality Assessment Method through Interaction with LLM Agents Embodying Different Personalities

Baiqiao Zhang, Xiangxian Li, Chao Zhou et al.

The low-intrusion and automated personality assessment is receiving increasing attention in psychology and human-computer interaction fields. This study explores an interactive approach for personality assessment, focusing on the multiplicity of personality representation. We propose a framework of Gamified Personality Assessment through Multi-Personality Representations (Multi-PR GPA). The framework leverages Large Language Models to empower virtual agents with different personalities. These agents elicit multifaceted human personality representations through engaging in interactive games. Drawing upon the multi-type textual data generated throughout the interaction, it achieves personality assessments with interpretable insights. Grounded in the classic Big Five personality theory, we developed a prototype system and conducted a user study to evaluate the efficacy of Multi-PR GPA. The results affirm the effectiveness of our approach in personality assessment and demonstrate its superior performance when considering the multiplicity of personality representation. Error structure analysis further revealed systematic assessment biases in LLMs, which multi-context aggregation partially mitigated.

SDSep 5, 2024
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Yucong Zhang, Xin Zou, Jinshan Yang et al.

This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

ASSep 28, 2025Code
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

Cancan Li, Fei Su, Juan Liu et al.

Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

SDAug 20, 2025Code
ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Yucong Zhang, Juan Liu, Ming Li

Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

LGJan 6, 2025
LOHA: Direct Graph Spectral Contrastive Learning Between Low-pass and High-pass Views

Ziyun Zou, Yinghui Jiang, Lian Shen et al.

Spectral Graph Neural Networks effectively handle graphs with different homophily levels, with low-pass filter mining feature smoothness and high-pass filter capturing differences. When these distinct filters could naturally form two opposite views for self-supervised learning, the commonalities between the counterparts for the same node remain unexplored, leading to suboptimal performance. In this paper, a simple yet effective self-supervised contrastive framework, LOHA, is proposed to address this gap. LOHA optimally leverages low-pass and high-pass views by embracing "harmony in diversity". Rather than solely maximizing the difference between these distinct views, which may lead to feature separation, LOHA harmonizes the diversity by treating the propagation of graph signals from both views as a composite feature. Specifically, a novel high-dimensional feature named spectral signal trend is proposed to serve as the basis for the composite feature, which remains relatively unaffected by changing filters and focuses solely on original feature differences. LOHA achieves an average performance improvement of 2.8% over runner-up models on 9 real-world datasets with varying homophily levels. Notably, LOHA even surpasses fully-supervised models on several datasets, which underscores the potential of LOHA in advancing the efficacy of spectral GNNs for diverse graph structures.

CVJul 17, 2025
SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Xiangyu Dong, Haoran Zhao, Jiang Gao et al.

Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

CVAug 21, 2025
LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

Yongju Jia, Jiarui Ma, Xiangxian Li et al.

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs' pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.

CVMay 24, 2025
Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

Guofeng Mei, Bin Ren, Juan Liu et al.

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

LGNov 25, 2025
Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation

Lian Shen, Zhendan Chen, Meijia Song et al.

In multimodal graph learning, graph structures that integrate information from multiple sources, such as vision and text, can more comprehensively model complex entity relationships. However, the continuous growth of their data scale poses a significant computational bottleneck for training. Graph condensation methods provide a feasible path forward by synthesizing compact and representative datasets. Nevertheless, existing condensation approaches generally suffer from performance limitations in multimodal scenarios, mainly due to two reasons: (1) semantic misalignment between different modalities leads to gradient conflicts; (2) the message-passing mechanism of graph neural networks further structurally amplifies such gradient noise. Based on this, we propose Structural Regularized Gradient Matching (SR-GM), a condensation framework for multimodal graphs. This method alleviates gradient conflicts between modalities through a gradient decoupling mechanism and introduces a structural damping regularizer to suppress the propagation of gradient noise in the topology, thereby transforming the graph structure from a noise amplifier into a training stabilizer. Extensive experiments on four multimodal graph datasets demonstrate the effectiveness of SR-GM, highlighting its state-of-the-art performance and cross-architecture generalization capabilities in multimodal graph dataset condensation.

SDOct 29, 2025
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Jiarong Du, Zhan Jin, Peijun Yang et al.

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

LGOct 20, 2025
CrossStateECG: Multi-Scale Deep Convolutional Network with Attention for Rest-Exercise ECG Biometrics

Dan Zheng, Jing Feng, Juan Liu

Current research in Electrocardiogram (ECG) biometrics mainly emphasizes resting-state conditions, leaving the performance decline in rest-exercise scenarios largely unresolved. This paper introduces CrossStateECG, a robust ECG-based authentication model explicitly tailored for cross-state (rest-exercise) conditions. The proposed model creatively combines multi-scale deep convolutional feature extraction with attention mechanisms to ensure strong identification across different physiological states. Experimental results on the exercise-ECGID dataset validate the effectiveness of CrossStateECG, achieving an identification accuracy of 92.50% in the Rest-to-Exercise scenario (training on resting ECG and testing on post-exercise ECG) and 94.72% in the Exercise-to-Rest scenario (training on post-exercise ECG and testing on resting ECG). Furthermore, CrossStateECG demonstrates exceptional performance across both state combinations, reaching an accuracy of 99.94% in Rest-to-Rest scenarios and 97.85% in Mixed-to-Mixed scenarios. Additional validations on the ECG-ID and MIT-BIH datasets further confirmed the generalization abilities of CrossStateECG, underscoring its potential as a practical solution for post-exercise ECG-based authentication in dynamic real-world settings.

ASJun 14, 2024
Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Yicong Jiang, Tianzi Wang, Xurong Xie et al.

Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model. We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs, to improve model recognition of Chinese dysarthric speech. Experimental results from our Chinese dysarthric speech dataset demonstrate consistent improvements in recognition performance with Perceiver-Prompt. Relative reduction up to 13.04% in CER is obtained over the fine-tuned Whisper.

CVJun 21, 2021
Improved Padding in CNNs for Quantitative Susceptibility Mapping

Juan Liu

Recently, deep learning methods have been proposed for quantitative susceptibility mapping (QSM) data processing: background field removal, field-to-source inversion, and single-step QSM reconstruction. However, the conventional padding mechanism used in convolutional neural networks (CNNs) can introduce spatial artifacts, especially in QSM background field removal and single-step QSM which requires inference from total fields with extreme large values at the edge boundaries of volume of interest. To address this issue, we propose an improved padding technique which utilizes the neighboring valid voxels to estimate the invalid voxels of feature maps at volume boundaries in the neural networks. Studies using simulated and in-vivo data show that the proposed padding greatly improves estimation accuracy and reduces artifacts in the results in the tasks of background field removal, field-to-source inversion, and single-step QSM reconstruction.

HCApr 12, 2021
A multi-plane augmented reality head-up display system based on volume holographic optical elements with large area

Zhenlv Lv, Juan Liu, Liangfa Xu

The traditional head-up display (HUD) system has the disadvantages of a small area and a single display plane, here we propose and design an augmented reality (AR) HUD system with multi-plane, large area, high diffraction efficiency and a single picture generation unit (PGU) based on holographic optical elements (HOEs). Since volume HOEs have excellent angle selectivity and wavelength selectivity, HOEs of different wavelengths can be designed to display images in different planes. Experimental and simulated results verify the feasibility of this method. Experimental results show that the diffraction efficiencies of the red, green and blue HOEs are 75.2%, 73.1% and 67.5%. And the size of HOEs is 20cm*15cm. Moreover, the three HOEs of red, green and blue display images at different depths of 150cm, 500cm and 1000cm, respectively. In addition, the field of view (FOV) and eye-box (EB) of the system are 12°*10° and 9.5cm*11.2cm. Furthermore, the light transmittance of the system has reached 60%. It is believed that this technique can be applied to the augmented reality navigation display of vehicles and aviation.

IVAug 14, 2020
Weakly-supervised Learning for Single-step Quantitative Susceptibility Mapping

Juan Liu, Kevin M Koch

Quantitative susceptibility mapping (QSM) utilizes MRI phase information to estimate tissue magnetic susceptibility. The generation of QSM requires solving ill-posed background field removal (BFR) and field-to-source inversion problems. Because current QSM techniques struggle to generate reliable QSM in clinical contexts, QSM clinical translation is greatly hindered. Recently, deep learning (DL) approaches for QSM reconstruction have shown impressive performance. Due to inherent non-existent ground-truth, these DL techniques use either calculation of susceptibility through multiple orientation sampling (COSMOS) maps or synthetic data for training, which are constrained by the availability and accuracy of COSMOS maps or domain shift when training data and testing data have different domains. To address these limitations, we propose a weakly-supervised single-step QSM reconstruction method, denoted as wTFI, to directly reconstruct QSM from the total field without BFR. wTFI uses the BFR method RESHARP local fields as supervision to perform a multi-task learning of local tissue fields and QSM, and is capable of recovering magnetic susceptibility estimates near the edges of the brain where are eroded in RESHARP and realize whole brain QSM estimation. Quantitative and qualitative evaluation shows that wTFI can generate high-quality local field and susceptibility maps in a variety of neuroimaging contexts.

ASApr 30, 2020
An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

Jing Han, Kun Qian, Meishu Song et al.

The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this study, we focus on developing some potential use-cases of intelligent speech analysis for COVID-19 diagnosed patients. In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety. For this purpose, two established acoustic feature sets and support vector machines are utilised. Our experiments show that an average accuracy of .69 obtained estimating the severity of illness, which is derived from the number of days in hospitalisation. We hope that this study can foster an extremely fast, low-cost, and convenient way to automatically detect the COVID-19 disease.

SDMar 24, 2020
COVID-19 and Computer Audition: An Overview on What Speech & Sound Analysis Could Contribute in the SARS-CoV-2 Corona Crisis

Björn W. Schuller, Dagmar M. Schuller, Kun Qian et al.

At the time of writing, the world population is suffering from more than 10,000 registered COVID-19 disease epidemic induced deaths since the outbreak of the Corona virus more than three months ago now officially known as SARS-CoV-2. Since, tremendous efforts have been made worldwide to counter-steer and control the epidemic by now labelled as pandemic. In this contribution, we provide an overview on the potential for computer audition (CA), i.e., the usage of speech and sound analysis by artificial intelligence to help in this scenario. We first survey which types of related or contextually significant phenomena can be automatically assessed from speech or sound. These include the automatic recognition and monitoring of breathing, dry and wet coughing or sneezing sounds, speech under cold, eating behaviour, sleepiness, or pain to name but a few. Then, we consider potential use-cases for exploitation. These include risk assessment and diagnosis based on symptom histograms and their development over time, as well as monitoring of spread, social distancing and its effects, treatment and recovery, and patient wellbeing. We quickly guide further through challenges that need to be faced for real-life usage. We come to the conclusion that CA appears ready for implementation of (pre-)diagnosis and monitoring tools, and more generally provides rich and significant, yet so far untapped potential in the fight against COVID-19 spread.

MED-PHApr 11, 2019
MRI Tissue Magnetism Quantification through Total Field Inversion with Deep Neural Networks

Juan Liu, Kevin M. Koch

Quantitative susceptibility mapping (QSM) utilizes MRI signal phase to infer estimates of local tissue magnetism (magnetic susceptibility), which has been shown useful to provide novel image contrast and as biomarkers of abnormal tissue. QSM requires addressing a challenging post-processing problem: filtering of image phase estimates and inversion of the phase to susceptibility relationship. A wide variety of quantification errors, robustness limitations, and artifacts plague QSM algorithms. To overcome these limitations, a robust deep-learning-based single-step QSM reconstruction approach is proposed and demonstrated. This neural network was trained using magnetostatic physics simulations based on in-vivo data sources. Random perturbations were added to the physics simulations to provide sufficient input-label pairs for the training purposes. The network was quantitatively tested using gold-standard in-silico labeled datasets against established QSM total field inversion approaches. In addition, the algorithm was applied to susceptibility-weighted imaging (SWI) data collected on a cohort of clinical subjects with brain hemmhorage. When quantitatively compared against gold-standard in-silico labels, the proposed algorithm outperformed the existing comparable approaches. High quality QSM were consistently estimated from clinical susceptibility-weighted data on 100 subjects without any noticeable inversion failures. The proposed approach was able to robustly generate high quality QSM with improved accuracy in in-silico gold-standard experiments. QSM produced by the proposed method can be generated in real-time on existing MRI scanner platforms and provide enhanced visualization and quantification of magnetism-based tissue contrasts.

AIApr 11, 2019
Non-locally Encoder-Decoder Convolutional Network for Whole Brain QSM Inversion

Juan Liu, Kevin M. Koch

Quantitative Susceptibility Mapping (QSM) reconstruction is a challenging inverse problem driven by ill conditioning of its field-to -susceptibility transformation. State-of-art QSM reconstruction methods either suffer from image artifacts or long computation times, which limits QSM clinical translation efforts. To overcome these limitations, a non-locally encoder-decoder gated convolutional neural network is trained to infer whole brain susceptibility map, using the local field and brain mask as the inputs. The performance of the proposed method is evaluated relative to synthetic data, a publicly available challenge dataset, and clinical datasets. The proposed approach can outperform existing methods on quantitative metrics and visual assessment of image sharpness and streaking artifacts. The estimated susceptibility maps can preserve conspicuity of fine features and suppress streaking artifacts. The demonstrated methods have potential value in advancing QSM clinical research and aiding in the translation of QSM to clinical operations.

MLJul 28, 2018
Group-sparse SVD Models and Their Applications in Biological Data

Wenwen Min, Juan Liu, Shihua Zhang

Sparse Singular Value Decomposition (SVD) models have been proposed for biclustering high dimensional gene expression data to identify block patterns with similar expressions. However, these models do not take into account prior group effects upon variable selection. To this end, we first propose group-sparse SVD models with group Lasso (GL1-SVD) and group L0-norm penalty (GL0-SVD) for non-overlapping group structure of variables. However, such group-sparse SVD models limit their applicability in some problems with overlapping structure. Thus, we also propose two group-sparse SVD models with overlapping group Lasso (OGL1-SVD) and overlapping group L0-norm penalty (OGL0-SVD). We first adopt an alternating iterative strategy to solve GL1-SVD based on a block coordinate descent method, and GL0-SVD based on a projection method. The key of solving OGL1-SVD is a proximal operator with overlapping group Lasso penalty. We employ an alternating direction method of multipliers (ADMM) to solve the proximal operator. Similarly, we develop an approximate method to solve OGL0-SVD. Applications of these methods and comparison with competing ones using simulated data demonstrate their effectiveness. Extensive applications of them onto several real gene expression data with gene prior group knowledge identify some biologically interpretable gene modules.

LGOct 13, 2017
Sparse Weighted Canonical Correlation Analysis

Wenwen Min, Juan Liu, Shihua Zhang

Given two data matrices $X$ and $Y$, sparse canonical correlation analysis (SCCA) is to seek two sparse canonical vectors $u$ and $v$ to maximize the correlation between $Xu$ and $Yv$. However, classical and sparse CCA models consider the contribution of all the samples of data matrices and thus cannot identify an underlying specific subset of samples. To this end, we propose a novel sparse weighted canonical correlation analysis (SWCCA), where weights are used for regularizing different samples. We solve the $L_0$-regularized SWCCA ($L_0$-SWCCA) using an alternating iterative algorithm. We apply $L_0$-SWCCA to synthetic data and real-world data to demonstrate its effectiveness and superiority compared to related methods. Lastly, we consider also SWCCA with different penalties like LASSO (Least absolute shrinkage and selection operator) and Group LASSO, and extend it for integrating more than three data matrices.

GNSep 21, 2016
Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery

Wenwen Min, Juan Liu, Shihua Zhang

Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term $λ\|\bm{w}\|_1 + η\bm{w}^T\bm{M}\bm{w}$, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different $\bm{M}$. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated $\bm{w}_i$ and $\bm{w}_j$ have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty $λ\|\bm{w}\|_1 + η|\bm{w}|^T\bm{M}|\bm{w}|$ to consider the difference between the absolute values of the coefficients. And we develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.

LGMar 19, 2016
L0-norm Sparse Graph-regularized SVD for Biclustering

Wenwen Min, Juan Liu, Shihua Zhang

Learning the "blocking" structure is a central challenge for high dimensional data (e.g., gene expression data). Recently, a sparse singular value decomposition (SVD) has been used as a biclustering tool to achieve this goal. However, this model ignores the structural information between variables (e.g., gene interaction graph). Although typical graph-regularized norm can incorporate such prior graph information to get accurate discovery and better interpretability, it fails to consider the opposite effect of variables with different signs. Motivated by the development of sparse coding and graph-regularized norm, we propose a novel sparse graph-regularized SVD as a powerful biclustering tool for analyzing high-dimensional data. The key of this method is to impose two penalties including a novel graph-regularized norm ($|\pmb{u}|\pmb{L}|\pmb{u}|$) and $L_0$-norm ($\|\pmb{u}\|_0$) on singular vectors to induce structural sparsity and enhance interpretability. We design an efficient Alternating Iterative Sparse Projection (AISP) algorithm to solve it. Finally, we apply our method and related ones to simulated and real data to show its efficiency in capturing natural blocking structures.

CROct 15, 2013
The BeiHang Keystroke Dynamics Authentication System

Juan Liu, Baochang Zhang, Linlin Shen et al.

Keystroke Dynamics is an important biometric solution for person authentication. Based upon keystroke dynamics, this paper designs an embedded password protection device, develops an online system, collects two public databases for promoting the research on keystroke authentication, exploits the Gabor filter bank to characterize keystroke dynamics, and provides benchmark results of three popular classification algorithms, one-class support vector machine, Gaussian classifier, and nearest neighbour classifier.