Shan Cao

CV
h-index11
17papers
417citations
Novelty51%
AI Score40

17 Papers

CVAug 12, 2023
TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot

Shan Cao, Qunsheng Ruan, Linjian Ma

Tongue segmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, when confronted with tongue images that differ from the training set or possess challenging backgrounds, these methods demonstrate limited performance. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation leverages its learned prior knowledge from natural images, enabling the achievement of zero-shot segmentation for various types of tongue images. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. Even when dealing with challenging background tongue images, TongueSAM achieves a mIoU of 95.23\% under zero-shot conditions, surpassing other segmentation methods. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project mentioned in this paper is currently publicly available.

IVSep 6, 2023
A Non-Invasive Interpretable NAFLD Diagnostic Method Combining TCM Tongue Features

Shan Cao, Qunsheng Ruan, Qingfeng Wu et al.

Non-alcoholic fatty liver disease (NAFLD) is a clinicopathological syndrome characterized by hepatic steatosis resulting from the exclusion of alcohol and other identifiable liver-damaging factors. It has emerged as a leading cause of chronic liver disease worldwide. Currently, the conventional methods for NAFLD detection are expensive and not suitable for users to perform daily diagnostics. To address this issue, this study proposes a non-invasive and interpretable NAFLD diagnostic method, the required user-provided indicators are only Gender, Age, Height, Weight, Waist Circumference, Hip Circumference, and tongue image. This method involves merging patients' physiological indicators with tongue features, which are then input into a fusion network named SelectorNet. SelectorNet combines attention mechanisms with feature selection mechanisms, enabling it to autonomously learn the ability to select important features. The experimental results show that the proposed method achieves an accuracy of 77.22\% using only non-invasive data, and it also provides compelling interpretability matrices. This study contributes to the early diagnosis of NAFLD and the intelligent advancement of TCM tongue diagnosis. The project mentioned in this paper is currently publicly available.

ARSep 6, 2025
Unlimited Vector Processing for Wireless Baseband Based on RISC-V Extension

Limin Jiang, Yi Shi, Yihao Shen et al.

Wireless baseband processing (WBP) serves as an ideal scenario for utilizing vector processing, which excels in managing data-parallel operations due to its parallel structure. However, conventional vector architectures face certain constraints such as limited vector register sizes, reliance on power-of-two vector length multipliers, and vector permutation capabilities tied to specific architectures. To address these challenges, we have introduced an instruction set extension (ISE) based on RISC-V known as unlimited vector processing (UVP). This extension enhances both the flexibility and efficiency of vector computations. UVP employs a novel programming model that supports non-power-of-two register groupings and hardware strip-mining, thus enabling smooth handling of vectors of varying lengths while reducing the software strip-mining burden. Vector instructions are categorized into symmetric and asymmetric classes, complemented by specialized load/store strategies to optimize execution. Moreover, we present a hardware implementation of UVP featuring sophisticated hazard detection mechanisms, optimized pipelines for symmetric tasks such as fixed-point multiplication and division, and a robust permutation engine for effective asymmetric operations. Comprehensive evaluations demonstrate that UVP significantly enhances performance, achieving up to 3.0$\times$ and 2.1$\times$ speedups in matrix multiplication and fast Fourier transform (FFT) tasks, respectively, when measured against lane-based vector architectures. Our synthesized RTL for a 16-lane configuration using SMIC 40nm technology spans 0.94 mm$^2$ and achieves an area efficiency of 21.2 GOPS/mm$^2$.

AIJul 22, 2025
SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Shuhao Mei, Yongchao Long, Shan Cao et al.

Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

LGMay 6, 2024
Deep Learning for Detecting and Early Predicting Chronic Obstructive Pulmonary Disease from Spirogram Time Series

Shuhao Mei, Xin Li, Yuxi Zhou et al.

Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung condition characterized by airflow obstruction. Current diagnostic methods primarily rely on identifying prominent features in spirometry (Volume-Flow time series) to detect COPD, but they are not adept at predicting future COPD risk based on subtle data patterns. In this study, we introduce a novel deep learning-based approach, DeepSpiro, aimed at the early prediction of future COPD risk. DeepSpiro consists of four key components: SpiroSmoother for stabilizing the Volume-Flow curve, SpiroEncoder for capturing volume variability-pattern through key patches of varying lengths, SpiroExplainer for integrating heterogeneous data and explaining predictions through volume attention, and SpiroPredictor for predicting the disease risk of undiagnosed high-risk patients based on key patch concavity, with prediction horizons of 1, 2, 3, 4, 5 years, or even longer. Evaluated on the UK Biobank dataset, DeepSpiro achieved an AUC of 0.8328 for COPD detection and demonstrated strong predictive performance for future COPD risk (p-value < 0.001). In summary, DeepSpiro can effectively predicts the long-term progression of the COPD disease.

CVOct 29, 2021
Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition

Dinghao Fan, Hengjie Lu, Shugong Xu et al.

Gesture recognition is getting more and more popular due to various application possibilities in human-machine interaction. Existing multi-modal gesture recognition systems take multi-modal data as input to improve accuracy, but such methods require more modality sensors, which will greatly limit their application scenarios. Therefore we propose an end-to-end multi-task learning framework in training 2D convolutional neural networks. The framework can use the depth modality to improve accuracy during training and save costs by using only RGB modality during inference. Our framework is trained to learn a representation for multi-task learning: gesture segmentation and gesture recognition. Depth modality contains the prior information for the location of the gesture. Therefore it can be used as the supervision for gesture segmentation. A plug-and-play module named Multi-Scale-Decoder is designed to realize gesture segmentation, which contains two sub-decoder. It is used in the lower stage and higher stage respectively, and can help the network pay attention to key target areas, ignore irrelevant information, and extract more discriminant features. Additionally, the MSD module and depth modality are only used in the training stage to improve gesture recognition performance. Only RGB modality and network without MSD are required during inference. Experimental results on three public gesture recognition datasets show that our proposed method provides superior performance compared with existing gesture recognition frameworks. Moreover, using the proposed plug-and-play MSD in other 2D CNN-based frameworks also get an excellent accuracy improvement.

CVAug 13, 2021
IFR: Iterative Fusion Based Recognizer For Low Quality Scene Text Recognition

Zhiwei Jia, Shugong Xu, Shiyi Mu et al.

Although recent works based on deep learning have made progress in improving recognition accuracy on scene text recognition, how to handle low-quality text images in end-to-end deep networks remains a research challenge. In this paper, we propose an Iterative Fusion based Recognizer (IFR) for low quality scene text recognition, taking advantage of refined text images input and robust feature representation. IFR contains two branches which focus on scene text recognition and low quality scene text image recovery respectively. We utilize an iterative collaboration between two branches, which can effectively alleviate the impact of low quality input. A feature fusion module is proposed to strengthen the feature representation of the two branches, where the features from the Recognizer are Fused with image Restoration branch, referred to as RRF. Without changing the recognition network structure, extensive quantitative and qualitative experimental results show that the proposed method significantly outperforms the baseline methods in boosting the recognition accuracy of benchmark datasets and low resolution images in TextZoom dataset.

CVJun 24, 2021
SGTBN: Generating Dense Depth Maps from Single-Line LiDAR

Hengjie Lu, Shugong Xu, Shan Cao

Depth completion aims to generate a dense depth map from the sparse depth map and aligned RGB image. However, current depth completion methods use extremely expensive 64-line LiDAR(about $100,000) to obtain sparse depth maps, which will limit their application scenarios. Compared with the 64-line LiDAR, the single-line LiDAR is much less expensive and much more robust. Therefore, we propose a method to tackle the problem of single-line depth completion, in which we aim to generate a dense depth map from the single-line LiDAR info and the aligned RGB image. A single-line depth completion dataset is proposed based on the existing 64-line depth completion dataset(KITTI). A network called Semantic Guided Two-Branch Network(SGTBN) which contains global and local branches to extract and fuse global and local info is proposed for this task. A Semantic guided depth upsampling module is used in our network to make full use of the semantic info in RGB images. Except for the usual MSE loss, we add the virtual normal loss to increase the constraint of high-order 3D geometry in our network. Our network outperforms the state-of-the-art in the single-line depth completion task. Besides, compared with the monocular depth estimation, our method also has significant advantages in precision and model size.

CVApr 1, 2021
Arbitrary-Shaped Text Detection withAdaptive Text Region Representation

Xiufeng Jiang, Shugong Xu, Shunqing Zhang et al.

Text detection/localization, as an important task in computer vision, has witnessed substantialadvancements in methodology and performance with convolutional neural networks. However, the vastmajority of popular methods use rectangles or quadrangles to describe text regions. These representationshave inherent drawbacks, especially relating to dense adjacent text and loose regional text boundaries,which usually cause difficulty detecting arbitrarily shaped text. In this paper, we propose a novel text regionrepresentation method, with a robust pipeline, which can precisely detect dense adjacent text instances witharbitrary shapes. We consider a text instance to be composed of an adaptive central text region mask anda corresponding expanding ratio between the central text region and the full text region. More specifically,our pipeline generates adaptive central text regions and corresponding expanding ratios with a proposedtraining strategy, followed by a new proposed post-processing algorithm which expands central text regionsto the complete text instance with the corresponding expanding ratios. We demonstrated that our new textregion representation is effective, and that the pipeline can precisely detect closely adjacent text instances ofarbitrary shapes. Experimental results on common datasets demonstrate superior performance o

CVMar 29, 2021
Tracking Based Semi-Automatic Annotation for Scene Text Videos

Jiajun Zhu, Xiufeng Jiang, Zhiwei Jia et al.

Recently, video scene text detection has received increasing attention due to its comprehensive applications. However, the lack of annotated scene text video datasets has become one of the most important problems, which hinders the development of video scene text detection. The existing scene text video datasets are not large-scale due to the expensive cost caused by manual labeling. In addition, the text instances in these datasets are too clear to be a challenge. To address the above issues, we propose a tracking based semi-automatic labeling strategy for scene text videos in this paper. We get semi-automatic scene text annotation by labeling manually for the first frame and tracking automatically for the subsequent frames, which avoid the huge cost of manual labeling. Moreover, a paired low-quality scene text video dataset named Text-RBL is proposed, consisting of raw videos, blurry videos, and low-resolution videos, labeled by the proposed convenient semi-automatic labeling strategy. Through an averaging operation and bicubic down-sampling operation over the raw videos, we can efficiently obtain blurry videos and low-resolution videos paired with raw videos separately. To verify the effectiveness of Text-RBL, we propose a baseline model combined with the text detector and tracker for video scene text detection. Moreover, a failure detection scheme is designed to alleviate the baseline model drift issue caused by complex scenes. Extensive experiments demonstrate that Text-RBL with paired low-quality videos labeled by the semi-automatic method can significantly improve the performance of the text detector in low-quality scenes.

CVMar 29, 2021
A Dataset and Benchmark Towards Multi-Modal Face Anti-Spoofing Under Surveillance Scenarios

Xudong Chen, Shugong Xu, Qiaobin Ji et al.

Face Anti-spoofing (FAS) is a challenging problem due to complex serving scenarios and diverse face presentation attack patterns. Especially when captured images are low-resolution, blurry, and coming from different domains, the performance of FAS will degrade significantly. The existing multi-modal FAS datasets rarely pay attention to the cross-domain problems under deployment scenarios, which is not conducive to the study of model performance. To solve these problems, we explore the fine-grained differences between multi-modal cameras and construct a cross-domain multi-modal FAS dataset under surveillance scenarios called GREAT-FASD-S. Besides, we propose an Attention based Face Anti-spoofing network with Feature Augment (AFA) to solve the FAS towards low-quality face images. It consists of the depthwise separable attention module (DAM) and the multi-modal based feature augment module (MFAM). Our model can achieve state-of-the-art performance on the CASIA-SURF dataset and our proposed GREAT-FASD-S dataset.

SDJul 12, 2020
Learning Frame Level Attention for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shunqing Zhang et al.

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose a convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. We investigated the classification performance when using different attention scaling function and applying different layers. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and our method achieved the state-of-the-art or competitive classification accuracy with lower computational complexity. We also visualized our attention results and observed that the proposed attention mechanism was able to lead the network tofocus on the semantically relevant parts of environmental sounds.

CVDec 5, 2019
E2-Capsule Neural Networks for Facial Expression Recognition Using AU-Aware Attention

Shan Cao, Yuqian Yao, Gaoyun An

Capsule neural network is a new and popular technique in deep learning. However, the traditional capsule neural network does not extract features sufficiently before the dynamic routing between the capsules. In this paper, the one Double Enhanced Capsule Neural Network (E2-Capsnet) that uses AU-aware attention for facial expression recognition (FER) is proposed. The E2-Capsnet takes advantage of dynamic routing between the capsules, and has two enhancement modules which are beneficial for FER. The first enhancement module is the convolutional neural network with AU-aware attention, which can help focus on the active areas of the expression. The second enhancement module is the capsule neural network with multiple convolutional layers, which enhances the ability of the feature representation. Finally, squashing function is used to classify the facial expression. We demonstrate the effectiveness of E2-Capsnet on the two public benchmark datasets, RAF-DB and EmotioNet. The experimental results show that our E2-Capsnet is superior to the state-of-the-art methods. Our implementation will be publicly available online.

SDAug 16, 2019
Sub-Spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level Fusion

Tianhao Qiao, Shunqing Zhang, Zhichao Zhang et al.

Environmental Sound Classification (ESC) is an important and challenging problem, and feature representation is a critical and even decisive factor in ESC. Feature representation ability directly affects the accuracy of sound classification. Therefore, the ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. In this paper, we propose a subspectrogram segmentation based ESC classification framework. In addition, we adopt the proposed Convolutional Recurrent Neural Network (CRNN) and score level fusion to jointly improve the classification accuracy. Extensive truncation schemes are evaluated to find the optimal number and the corresponding band ranges of sub-spectrograms. Based on the numerical experiments, the proposed framework can achieve 81.9% ESC classification accuracy on the public dataset ESC-50, which provides 9.1% accuracy improvement over traditional baseline schemes.

SDJul 4, 2019
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Tianhao Qiao et al.

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose an convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and achieved the state-of-the-art performance in terms of classification accuracy.

NIApr 9, 2019
Passive TCP Identification for Wired and WirelessNetworks: A Long-Short Term Memory Approach

Xiaoyu Chen, Shugong Xu, Xudong Chen et al.

Transmission control protocol (TCP) congestion control is one of the key techniques to improve network performance. TCP congestion control algorithm identification (TCP identification) can be used to significantly improve network efficiency. Existing TCP identification methods can only be applied to limited number of TCP congestion control algorithms and focus on wired networks. In this paper, we proposed a machine learning based passive TCP identification method for wired and wireless networks. After comparing among three typical machine learning models, we concluded that the 4-layers Long Short Term Memory (LSTM) model achieves the best identification accuracy. Our approach achieves better than 98% accuracy in wired and wireless networks and works for newly proposed TCP congestion control algorithms.

SDAug 25, 2018
Deep Convolutional Neural Network with Mixup for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shan Cao et al.

Environmental sound classification (ESC) is an important and challenging problem. In contrast to speech, sound events have noise-like nature and may be produced by a wide variety of sources. In this paper, we propose to use a novel deep convolutional neural network for ESC tasks. Our network architecture uses stacked convolutional and pooling layers to extract high-level feature representations from spectrogram-like features. Furthermore, we apply mixup to ESC tasks and explore its impacts on classification performance and feature distribution. Experiments were conducted on UrbanSound8K, ESC-50 and ESC-10 datasets. Our experimental results demonstrated that our ESC system has achieved the state-of-the-art performance (83.7%) on UrbanSound8K and competitive performance on ESC-50 and ESC-10.