Nan Zhao

CV
h-index29
15papers
1,092citations
Novelty40%
AI Score51

15 Papers

SDJul 4, 2024Code
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Keyu An, Qian Chen, Chong Deng et al.

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

CVJul 8, 2022
Abs-CAM: A Gradient Optimization Interpretable Approach for Explanation of Convolutional Neural Networks

Chunyan Zeng, Kang Yan, Zhifeng Wang et al.

The black-box nature of Deep Neural Networks (DNNs) severely hinders its performance improvement and application in specific scenes. In recent years, class activation mapping-based method has been widely used to interpret the internal decisions of models in computer vision tasks. However, when this method uses backpropagation to obtain gradients, it will cause noise in the saliency map, and even locate features that are irrelevant to decisions. In this paper, we propose an Absolute value Class Activation Mapping-based (Abs-CAM) method, which optimizes the gradients derived from the backpropagation and turns all of them into positive gradients to enhance the visual features of output neurons' activation, and improve the localization ability of the saliency map. The framework of Abs-CAM is divided into two phases: generating initial saliency map and generating final saliency map. The first phase improves the localization ability of the saliency map by optimizing the gradient, and the second phase linearly combines the initial saliency map with the original image to enhance the semantic information of the saliency map. We conduct qualitative and quantitative evaluation of the proposed method, including Deletion, Insertion, and Pointing Game. The experimental results show that the Abs-CAM can obviously eliminate the noise in the saliency map, and can better locate the features related to decisions, and is superior to the previous methods in recognition and localization tasks.

IRMay 26
Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Ge Fan, Nan Zhao, Kai Meng et al.

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.

SDAug 25, 2022
Spatio-Temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings

Chunyan Zeng, Shixiong Feng, Zhifeng Wang et al.

The existing source cell-phone recognition method lacks the long-term feature characterization of the source device, resulting in inaccurate representation of the source cell-phone related features which leads to insufficient recognition accuracy. In this paper, we propose a source cell-phone recognition method based on spatio-temporal representation learning, which includes two main parts: extraction of sequential Gaussian mean matrix features and construction of a recognition model based on spatio-temporal representation learning. In the feature extraction part, based on the analysis of time-series representation of recording source signals, we extract sequential Gaussian mean matrix with long-term and short-term representation ability by using the sensitivity of Gaussian mixture model to data distribution. In the model construction part, we design a structured spatio-temporal representation learning network C3D-BiLSTM to fully characterize the spatio-temporal information, combine 3D convolutional network and bidirectional long short-term memory network for short-term spectral information and long-time fluctuation information representation learning, and achieve accurate recognition of cell-phones by fusing spatio-temporal feature information of recording source signals. The method achieves an average accuracy of 99.03% for the closed-set recognition of 45 cell-phones under the CCNU\_Mobile dataset, and 98.18% in small sample size experiments, with recognition performance better than the existing state-of-the-art methods. The experimental results show that the method exhibits excellent recognition performance in multi-class cell-phones recognition.

CVNov 11, 2022
JSRNN: Joint Sampling and Reconstruction Neural Networks for High Quality Image Compressed Sensing

Chunyan Zeng, Jiaxiang Ye, Zhifeng Wang et al.

Most Deep Learning (DL) based Compressed Sensing (DCS) algorithms adopt a single neural network for signal reconstruction, and fail to jointly consider the influences of the sampling operation for reconstruction. In this paper, we propose unified framework, which jointly considers the sampling and reconstruction process for image compressive sensing based on well-designed cascade neural networks. Two sub-networks, which are the sampling sub-network and the reconstruction sub-network, are included in the proposed framework. In the sampling sub-network, an adaptive full connected layer instead of the traditional random matrix is used to mimic the sampling operator. In the reconstruction sub-network, a cascade network combining stacked denoising autoencoder (SDA) and convolutional neural network (CNN) is designed to reconstruct signals. The SDA is used to solve the signal mapping problem and the signals are initially reconstructed. Furthermore, CNN is used to fully recover the structure and texture features of the image to obtain better reconstruction performance. Extensive experiments show that this framework outperforms many other state-of-the-art methods, especially at low sampling rates.

HCMar 12
AI Chatbots or Human Therapists? Belief-Based Predictors of Mental Health Help-Seeking Intentions in the Age of Generative AI

Junsang Park, Sarah Brown, David L. Vogel et al.

As generative artificial intelligence (GAI) enters the mental health landscape, questions arise about how individuals weigh AI tools against human therapists. This study examined belief-based predictors of intention to use GAI and therapists across two populations: a university sample (N = 1,155) and a nationally representative adult sample (N = 651). Using paired-sample t-tests following a MANOVA, we found that human therapists were viewed as providing greater emotional support and coping, relationship, and educational skills as well as being able to personalize treatment than GAI chatbots. In turn, GAI support was viewed as being more affordable and accessible. No differences between modalities were found with concerns about privacy, reliability, stigma, mental health literacy or help-seeking norms. Using LASSO regression, we examined how beliefs about each modality jointly shape help-seeking intentions. Across both samples, intentions to use either GAI or human therapists were most strongly associated with perceptions of interpersonal support, including emotional support, relational guidance, and personalization. Barriers differed across modalities: concerns about privacy and reliability were more strongly associated with reduced intention to use GAI, whereas structural constraints, particularly affordability, were more closely linked to human therapy use. These findings extend the Health Belief Model to a dual-modality context, demonstrating that help-seeking decisions reflect a comparative push-pull process in which barriers to one modality redirect users toward the other. Design implications are discussed for developing trustworthy, emotionally resonant GAI tools that complement rather than replace human care.

CLJan 10, 2025
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Qian Chen, Yafeng Chen, Yanni Chen et al.

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

ITDec 17, 2024
Distributed satellite information networks: Architecture, enabling technologies, and trends

Qinyu Zhang, Liang Xu, Jianhao Huang et al.

Driven by the vision of ubiquitous connectivity and wireless intelligence, the evolution of ultra-dense constellation-based satellite-integrated Internet is underway, now taking preliminary shape. Nevertheless, the entrenched institutional silos and limited, nonrenewable heterogeneous network resources leave current satellite systems struggling to accommodate the escalating demands of next-generation intelligent applications. In this context, the distributed satellite information networks (DSIN), exemplified by the cohesive clustered satellites system, have emerged as an innovative architecture, bridging information gaps across diverse satellite systems, such as communication, navigation, and remote sensing, and establishing a unified, open information network paradigm to support resilient space information services. This survey first provides a profound discussion about innovative network architectures of DSIN, encompassing distributed regenerative satellite network architecture, distributed satellite computing network architecture, and reconfigurable satellite formation flying, to enable flexible and scalable communication, computing and control. The DSIN faces challenges from network heterogeneity, unpredictable channel dynamics, sparse resources, and decentralized collaboration frameworks. To address these issues, a series of enabling technologies is identified, including channel modeling and estimation, cloud-native distributed MIMO cooperation, grant-free massive access, network routing, and the proper combination of all these diversity techniques. Furthermore, to heighten the overall resource efficiency, the cross-layer optimization techniques are further developed to meet upper-layer deterministic, adaptive and secure information services requirements. In addition, emerging research directions and new opportunities are highlighted on the way to achieving the DSIN vision.

CVJan 25
Domain-Expert-Guided Hybrid Mixture-of-Experts for Medical AI: Integrating Data-Driven Learning with Clinical Priors

Jinchen Gu, Nan Zhao, Lei Qiu et al.

Mixture-of-Experts (MoE) models increase representational capacity with modest computational cost, but their effectiveness in specialized domains such as medicine is limited by small datasets. In contrast, clinical practice offers rich expert knowledge, such as physician gaze patterns and diagnostic heuristics, that models cannot reliably learn from limited data. Combining data-driven experts, which capture novel patterns, with domain-expert-guided experts, which encode accumulated clinical insights, provides complementary strengths for robust and clinically meaningful learning. To this end, we propose Domain-Knowledge-Guided Hybrid MoE (DKGH-MoE), a plug-and-play and interpretable module that unifies data-driven learning with domain expertise. DKGH-MoE integrates a data-driven MoE to extract novel features from raw imaging data, and a domain-expert-guided MoE incorporates clinical priors, specifically clinician eye-gaze cues, to emphasize regions of high diagnostic relevance. By integrating domain expert insights with data-driven features, DKGH-MoE improves both performance and interpretability.

CVSep 1, 2025
A Stroke-Level Large-Scale Database of Chinese Character Handwriting and the OpenHandWrite_Toolbox for Handwriting Research

Zebo Xu, Shaoyun Yu, Mark Torrance et al.

Understanding what linguistic components (e.g., phonological, semantic, and orthographic systems) modulate Chinese handwriting at the character, radical, and stroke levels remains an important yet understudied topic. Additionally, there is a lack of comprehensive tools for capturing and batch-processing fine-grained handwriting data. To address these issues, we constructed a large-scale handwriting database in which 42 Chinese speakers for each handwriting 1200 characters in a handwriting-to-dictation task. Additionally, we enhanced the existing handwriting package and provided comprehensive documentation for the upgraded OpenHandWrite_Toolbox, which can easily modify the experimental design, capture the stroke-level handwriting trajectory, and batch-process handwriting measurements (e.g., latency, duration, and pen-pressure). In analysing our large-scale database, multiple regression results show that orthographic predictors impact handwriting preparation and execution across character, radical, and stroke levels. Phonological factors also influence execution at all three levels. Importantly, these lexical effects demonstrate hierarchical attenuation - they were most pronounced at the character level, followed by the radical, and were weakest at the stroke levels. These findings demonstrate that handwriting preparation and execution at the radical and stroke levels are closely intertwined with linguistic components. This database and toolbox offer valuable resources for future psycholinguistic and neurolinguistic research on the handwriting of characters and sub-characters across different languages.

CLSep 27, 2021
The JDDC 2.0 Corpus: A Large-Scale Multimodal Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

Nan Zhao, Haoran Li, Youzheng Wu et al.

With the development of the Internet, more and more people get accustomed to online shopping. When communicating with customer service, users may express their requirements by means of text, images, and videos, which precipitates the need for understanding these multimodal information for automatic customer service systems. Images usually act as discriminators for product models, or indicators of product failures, which play important roles in the E-commerce scenario. On the other hand, detailed information provided by the images is limited, and typically, customer service systems cannot understand the intents of users without the input text. Thus, bridging the gap of the image and text is crucial for the multimodal dialogue task. To handle this problem, we construct JDDC 2.0, a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform (JD.com), containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. We present the solutions of top-5 teams participating in the JDDC multimodal dialogue challenge based on this dataset, which provides valuable insights for further researches on the multimodal dialogue task.

LGNov 1, 2019
Robust Federated Learning with Noisy Communication

Fan Ang, Li Chen, Nan Zhao et al.

Federated learning is a communication-efficient training process that alternates between local training at the edge devices and averaging the updated local model at the central server. Nevertheless, it is impractical to achieve a perfect acquisition of the local models in wireless communication due to noise, which also brings serious effects on federated learning. To tackle this challenge, we propose a robust design for federated learning to alleviate the effects of noise in this paper. Considering noise in the two aforementioned steps, we first formulate the training problem as a parallel optimization for each node under the expectation-based model and the worst-case model. Due to the non-convexity of the problem, a regularization for the loss function approximation method is proposed to make it tractable. Regarding the worst-case model, we develop a feasible training scheme which utilizes the sampling-based successive convex approximation algorithm to tackle the unavailable maxima or minima noise condition and the non-convex issue of the objective function. Furthermore, the convergence rates of both new designs are analyzed from a theoretical point of view. Finally, the improvement of prediction accuracy and the reduction of loss function are demonstrated via simulations for the proposed designs.

CVOct 8, 2019
Sky pixel detection in outdoor imagery using an adaptive algorithm and machine learning

Kerry A. Nice, Jasper S. Wijnands, Ariane Middel et al.

Computer vision techniques enable automated detection of sky pixels in outdoor imagery. In urban climate, sky detection is an important first step in gathering information about urban morphology and sky view factors. However, obtaining accurate results remains challenging and becomes even more complex using imagery captured under a variety of lighting and weather conditions. To address this problem, we present a new sky pixel detection system demonstrated to produce accurate results using a wide range of outdoor imagery types. Images are processed using a selection of mean-shift segmentation, K-means clustering, and Sobel filters to mark sky pixels in the scene. The algorithm for a specific image is chosen by a convolutional neural network, trained with 25,000 images from the Skyfinder data set, reaching 82% accuracy for the top three classes. This selection step allows the sky marking to follow an adaptive process and to use different techniques and parameters to best suit a particular image. An evaluation of fourteen different techniques and parameter sets shows that no single technique can perform with high accuracy across varied Skyfinder and Google Street View data sets. However, by using our adaptive process, large increases in accuracy are observed. The resulting system is shown to perform better than other published techniques.

CVMay 23, 2018
RGB-T Object Tracking:Benchmark and Baseline

Chenglong Li, Xinyan Liang, Yijuan Lu et al.

RGB-Thermal (RGB-T) object tracking receives more and more attention due to the strongly complementary benefits of thermal information to visible data. However, RGB-T research is limited by lacking a comprehensive evaluation platform. In this paper, we propose a large-scale video benchmark dataset for RGB-T tracking.It has three major advantages over existing ones: 1) Its size is sufficiently large for large-scale performance evaluation (total frame number: 234K, maximum frame per sequence: 8K). 2) The alignment between RGB-T sequence pairs is highly accurate, which does not need pre- or post-processing. 3) The occlusion levels are annotated for occlusion-sensitive performance analysis of different tracking algorithms.Moreover, we propose a novel graph-based approach to learn a robust object representation for RGB-T tracking. In particular, the tracked object is represented with a graph with image patches as nodes. This graph including graph structure, node weights and edge weights is dynamically learned in a unified ADMM (alternating direction method of multipliers)-based optimization framework, in which the modality weights are also incorporated for adaptive fusion of multiple source data.Extensive experiments on the large-scale dataset are executed to demonstrate the effectiveness of the proposed tracker against other state-of-the-art tracking methods. We also provide new insights and potential research directions to the field of RGB-T object tracking.

CYAug 4, 2015
Recognition of Emotions using Kinects

Shun Li, Changye Zhu, Liqing Cui et al.

Psychological studies indicate that emotional states are expressed in the way people walk and the human gait is investigated in terms of its ability to reveal a person's emotional state. And Microsoft Kinect is a rapidly developing, inexpensive, portable and no-marker motion capture system. This paper gives a new referable method to do emotion recognition, by using Microsoft Kinect to do gait pattern analysis, which has not been reported. $59$ subjects are recruited in this study and their gait patterns are record by two Kinect cameras. Significant joints selecting, Coordinate system transforming, Slider window gauss filter, Differential operation, and Data segmentation are used in data preprocessing. Feature extracting is based on Fourier transformation. By using the NaiveBayes, RandomForests, libSVM and SMO classification, the recognition rate of natural and unnatural emotions can reach above 70%.It is concluded that using the Kinect system can be a new method in recognition of emotions.