CVMay 18Code
Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object DetectionYixing Yong, Jian Wang, Ming Lei et al.
Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.
APMay 28, 2010
Error Analysis of Approximated PCRLBs for Nonlinear DynamicsMing Lei, Pierre Del Moral, Christophe Baehr
In practical nonlinear filtering, the assessment of achievable filtering performance is important. In this paper, we focus on the problem of efficiently approximate the posterior Cramer-Rao lower bound (CRLB) in a recursive manner. By using Gaussian assumptions, two types of approximations for calculating the CRLB are proposed: An exact model using the state estimate as well as a Taylor-series-expanded model using both of the state estimate and its error covariance, are derived. Moreover, the difference between the two approximated CRLBs is also formulated analytically. By employing the particle filter (PF) and the unscented Kalman filter (UKF) to compute, simulation results reveal that the approximated CRLB using mean-covariance-based model outperforms that using the mean-based exact model. It is also shown that the theoretical difference between the estimated CRLBs can be improved through an improved filtering method.
ASApr 20
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASRYuan Xie, Jiaqi Song, Guang Qiu et al.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
LGApr 28
FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge DevicesChangyu Li, Shuanghong Huang, Jiashen Liu et al.
Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.
AIApr 28
PI-TTA: Physics-Informed Source-Free Test-Time Adaptation for Robust Human Activity Recognition on Mobile DevicesChangyu Li, Lu Wang, Ming Lei et al.
Source-free test-time adaptation (TTA) is appealing for mobile and wearable sensing because it enables on-device personalization from unlabeled test streams without centralizing private data. However, sensor-based human activity recognition (HAR) poses challenges that are less pronounced in standard vision benchmarks: behavioral inertial streams are temporally correlated and often exhibit within-session shifts caused by sensor rotation, placement change, and sampling-rate drift. Under this streaming non-i.i.d. setting, widely used vision-style TTA objectives can become unstable, leading to overconfident errors, representation collapse, and catastrophic forgetting. We propose PI-TTA, a lightweight source-free adaptation framework that stabilizes online updates through three physics-consistent constraints: gravity consistency, short-horizon temporal continuity, and spectral stability. PI-TTA updates the same small parameter subset as strong source-free baselines and incurs only modest overhead, making it suitable for on-device deployment. Experiments on USCHAD, PAMAP2, and mHealth under long-sequence stress tests and factorized shift protocols show that PI-TTA mitigates the severe degradation observed in confidence-driven baselines and preserves stable adaptation under sustained streaming conditions. It improves long-sequence accuracy by up to 9.13% and reduces physical-violation rates by 27.5%, 24.1%, and 45.4% on USCHAD, PAMAP2, and mHealth, respectively. These results demonstrate that physics-informed adaptation can improve accuracy, stability, and deployment reliability for real-world mobile sensing systems.
LGApr 2
A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement LearningMing Lei, Christophe Baehr
Reinforcement learning (RL) has become a key approach for enhancing reasoning in large language models (LLMs), yet scalable training is often hindered by the rapid collapse of policy entropy, which leads to premature convergence and performance saturation. This paper provides a comparative theoretical analysis of two entropy control strategies: traditional entropy regularization and the recently proposed covariance-based mechanism. We establish a unified framework for entropy dynamics under softmax parameterization, showing that entropy change is governed by the covariance between log-probabilities and logit updates. Our analysis reveals that traditional entropy regularization introduces a dense, persistent bias that modifies the stationary condition, leading to suboptimal policies, while covariance-based methods selectively regularize a sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when the regularization coefficient is annealed. These results provide principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks.
CVApr 27
DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and DiscoveryJiawei Wang, Ming Lei, Yaning Yang et al.
Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
ASApr 9
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMsYuan Xie, Jiaqi Song, Guang Qiu et al.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
LGMar 25, 2025
Geometric Meta-Learning via Coupled Ricci Flow: Unifying Knowledge Representation and Quantum EntanglementMing Lei, Christophe Baehr
This paper establishes a unified framework integrating geometric flows with deep learning through three fundamental innovations. First, we propose a thermodynamically coupled Ricci flow that dynamically adapts parameter space geometry to loss landscape topology, formally proved to preserve isometric knowledge embedding (Theorem~\ref{thm:isometric}). Second, we derive explicit phase transition thresholds and critical learning rates (Theorem~\ref{thm:critical}) through curvature blowup analysis, enabling automated singularity resolution via geometric surgery (Lemma~\ref{lem:surgery}). Third, we establish an AdS/CFT-type holographic duality (Theorem~\ref{thm:ads}) between neural networks and conformal field theories, providing entanglement entropy bounds for regularization design. Experiments demonstrate 2.1$\times$ convergence acceleration and 63\% topological simplification while maintaining $\mathcal{O}(N\log N)$ complexity, outperforming Riemannian baselines by 15.2\% in few-shot accuracy. Theoretically, we prove exponential stability (Theorem~\ref{thm:converge}) through a new Lyapunov function combining Perelman entropy with Wasserstein gradient flows, fundamentally advancing geometric deep learning.
LGMar 12
A Geometrically-Grounded Drive for MDL-Based Optimization in Deep LearningMing Lei, Shufan Wu, Christophe Baehr
This paper introduces a novel optimization framework that fundamentally integrates the Minimum Description Length (MDL) principle into the training dynamics of deep neural networks. Moving beyond its conventional role as a model selection criterion, we reformulate MDL as an active, adaptive driving force within the optimization process itself. The core of our method is a geometrically-grounded cognitive manifold whose evolution is governed by a \textit{coupled Ricci flow}, enriched with a novel \textit{MDL Drive} term derived from first principles. This drive, modulated by the task-loss gradient, creates a seamless harmony between data fidelity and model simplification, actively compressing the internal representation during training. We establish a comprehensive theoretical foundation, proving key properties including the monotonic decrease of description length (Theorem~\ref{thm:convergence}), a finite number of topological phase transitions via a geometric surgery protocol (Theorems~\ref{thm:surgery}, \ref{thm:ultimate_fate}), and the emergence of universal critical behavior (Theorem~\ref{thm:universality}). Furthermore, we provide a practical, computationally efficient algorithm with $O(N \log N)$ per-iteration complexity (Theorem~\ref{thm:complexity}), alongside guarantees for numerical stability (Theorem~\ref{thm:stability}) and exponential convergence under convexity assumptions (Theorem~\ref{thm:convergence_rate}). Empirical validation on synthetic regression and classification tasks confirms the theoretical predictions, demonstrating the algorithm's efficacy in achieving robust generalization and autonomous model simplification. This work provides a principled path toward more autonomous, generalizable, and interpretable AI systems by unifying geometric deep learning with information-theoretic principles.
LGMar 12
HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal UnderstandingMing Lei, Shufan Wu, Christophe Baehr
The ability to understand and reason about cause and effect -- encompassing interventions, counterfactuals, and underlying mechanisms -- is a cornerstone of robust artificial intelligence. While deep learning excels at pattern recognition, it fundamentally lacks a model of causality, making systems brittle under distribution shifts and unable to answer ``what-if'' questions. This paper introduces the \emph{Hierarchical Causal Primitive Dynamic Composition Network (HCP-DCNet)}, a unified framework that bridges continuous physical dynamics with discrete symbolic causal inference. Departing from monolithic representations, HCP-DCNet decomposes causal scenes into reusable, typed \emph{causal primitives} organized into four abstraction layers: physical, functional, event, and rule. A dual-channel routing network dynamically composes these primitives into task-specific, fully differentiable \emph{Causal Execution Graphs (CEGs)}. Crucially, the system employs a \emph{causal-intervention-driven meta-evolution} strategy, enabling autonomous self-improvement through a constrained Markov decision process. We establish rigorous theoretical guarantees, including type-safe composition, routing convergence, and universal approximation of causal dynamics. Extensive experiments across simulated physical and social environments demonstrate that HCP-DCNet significantly outperforms state-of-the-art baselines in causal discovery, counterfactual reasoning, and compositional generalization. This work provides a principled, scalable, and interpretable architecture for building AI systems with human-like causal abstraction and continual self-refinement capabilities.
AIOct 27, 2025
Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation PlannerKechen Meng, Sinuo Zhang, Rongpeng Li et al.
In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). While centralized Multi-Agent Reinforcement Learning (MARL) frameworks rely on a central coordinator for policy training and resource scheduling, they suffer from scalability issues and privacy risks. In contrast, the Distributed Training with Decentralized Execution (DTDE) paradigm enables distributed learning and decision-making, but it struggles with non-stationarity and limited inter-agent cooperation, which can severely degrade system performance. To overcome these challenges, we propose the Multi-Agent Conditional Diffusion Model Planner (MA-CDMP) for decentralized communication resource management. Built upon the Model-Based Reinforcement Learning (MBRL) paradigm, MA-CDMP employs Diffusion Models (DMs) to capture environment dynamics and plan future trajectories, while an inverse dynamics model guides action generation, thereby alleviating the sample inefficiency and slow convergence of conventional DTDE methods. Moreover, to approximate large-scale agent interactions, a Mean-Field (MF) mechanism is introduced as an assistance to the classifier in DMs. This design mitigates inter-agent non-stationarity and enhances cooperation with minimal communication overhead in distributed settings. We further theoretically establish an upper bound on the distributional approximation error introduced by the MF-based diffusion generation, guaranteeing convergence stability and reliable modeling of multi-agent stochastic dynamics. Extensive experiments demonstrate that MA-CDMP consistently outperforms existing MARL baselines in terms of average reward and QoS metrics, showcasing its scalability and practicality for real-world wireless network optimization.
ASFeb 16, 2022
ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-SpeechYi Ren, Ming Lei, Zhiying Huang et al.
Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
SDNov 28, 2021
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual InformationZhihao Du, Shiliang Zhang, Siqi Zheng et al.
Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm.
ASOct 14, 2021
FedSpeech: Federated Text-to-Speech with Continual LearningZiyue Jiang, Yi Ren, Ming Lei et al.
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this paper, we propose a novel federated learning architecture based on continual learning approaches to overcome the difficulties above. Specifically, 1) we use gradual pruning masks to isolate parameters for preserving speakers' tones; 2) we apply selective masks for effectively reusing knowledge from tasks; 3) a private speaker embedding is introduced to keep users' privacy. Experiments on a reduced VCTK dataset demonstrate the effectiveness of FedSpeech: it nearly matches multi-task training in terms of multi-speaker speech quality; moreover, it sufficiently retains the speakers' tones and even outperforms the multi-task training in the speaker similarity experiment.
SDSep 9, 2021
BeamTransformer: Microphone Array-based Overlapping Speech DetectionSiqi Zheng, Shiliang Zhang, Weilong Huang et al.
We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlapping speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlapping segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.
CLJun 17, 2021
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech ModelChenye Cui, Yi Ren, Jinglin Liu et al.
Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the state-of-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. After that, we propose a simple but efficient architecture for emotional speech synthesis called EMSpeech. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations. Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.
SDApr 6, 2021
Extremely Low Footprint End-to-End ASR System for Smart DeviceZhifu Gao, Yiwu Yao, Shiliang Zhang et al.
Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer from requiring a large number of model parameters. We propose an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We design cross-layer weight sharing to improve parameter efficiency and further exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency. We evaluate our approaches on the public AISHELL-1 and AISHELL-2 benchmarks. On the AISHELL-2 task, the proposed method achieves more than 10x compression (model size reduces from 248 to 24MB), at the cost of only minor performance loss (CER reduces from 6.49% to 6.92%).
ASOct 29, 2020
DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-SpeechZhiying Huang, Hao Li, Ming Lei
With the number of smart devices increasing, the demand for on-device text-to-speech (TTS) increases rapidly. In recent years, many prominent End-to-End TTS methods have been proposed, and have greatly improved the quality of synthesized speech. However, to ensure the qualified speech, most TTS systems depend on large and complex neural network models, and it's hard to deploy these TTS systems on-device. In this paper, a small-footprint, fast, stable network for on-device TTS is proposed, named as DeviceTTS. DeviceTTS makes use of a duration predictor as a bridge between encoder and decoder so as to avoid the problem of words skipping and repeating in Tacotron. As we all know, model size is a key factor for on-device TTS. For DeviceTTS, Deep Feedforward Sequential Memory Network (DFSMN) is used as the basic component. Moreover, to speed up inference, mix-resolution decoder is proposed for balance the inference speed and speech quality. Experiences are done with WORLD and LPCNet vocoder. Finally, with only 1.4 million model parameters and 0.099 GFLOPS, DeviceTTS achieves comparable performance with Tacotron and FastSpeech. As far as we know, the DeviceTTS can meet the needs of most of the devices in practical application.
SDOct 27, 2020
Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder ModelZhifu Gao, Shiliang Zhang, Ming Lei et al.
Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems.
SPJun 11, 2020
A PDD Decoder for Binary Linear Codes With Neural Check Polytope ProjectionYi Wei, Ming-Min Zhao, Min-Jian Zhao et al.
Linear Programming (LP) is an important decoding technique for binary linear codes. However, the advantages of LP decoding, such as low error floor and strong theoretical guarantee, etc., come at the cost of high computational complexity and poor performance at the low signal-to-noise ratio (SNR) region. In this letter, we adopt the penalty dual decomposition (PDD) framework and propose a PDD algorithm to address the fundamental polytope based maximum likelihood (ML) decoding problem. Furthermore, we propose to integrate machine learning techniques into the most time-consuming part of the PDD decoding algorithm, i.e., check polytope projection (CPP). Inspired by the fact that a multi-layer perception (MLP) can theoretically approximate any nonlinear mapping function, we present a specially designed neural CPP (NCPP) algorithm to decrease the decoding latency. Simulation results demonstrate the effectiveness of the proposed algorithms.
SDMay 21, 2020
Simplified Self-Attention for Transformer-based End-to-End Speech RecognitionHaoneng Luo, Shiliang Zhang, Ming Lei et al.
Transformer models have been introduced into end-to-end speech recognition with state-of-the-art performance on various tasks owing to their superiority in modeling long-term dependencies. However, such improvements are usually obtained through the use of very large neural networks. Transformer models mainly include two submodules - position-wise feedforward layers and self-attention (SAN) layers. In this paper, to reduce the model complexity while maintaining good performance, we propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks. Results show that our proposed SSAN-based transformer model can achieve over 20% relative reduction in model parameters and 6.7% relative CER reduction on the AISHELL-1 task. With impressively 20% parameter reduction, our model shows no loss of recognition performance on the 20,000-hour large-scale task.
SDMay 21, 2020
Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech RecognitionShiliang Zhang, Zhifu Gao, Haoneng Luo et al.
Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.
SDMay 21, 2020
SAN-M: Memory Equipped Self-Attention for End-to-End Speech RecognitionZhifu Gao, Shiliang Zhang, Ming Lei et al.
End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity.In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.
ITFeb 14, 2020
ADMM-based Decoder for Binary Linear Codes Aided by Deep LearningYi Wei, Ming-Min Zhao, Min-Jian Zhao et al.
Inspired by the recent advances in deep learning (DL), this work presents a deep neural network aided decoding algorithm for binary linear codes. Based on the concept of deep unfolding, we design a decoding network by unfolding the alternating direction method of multipliers (ADMM)-penalized decoder. In addition, we propose two improved versions of the proposed network. The first one transforms the penalty parameter into a set of iteration-dependent ones, and the second one adopts a specially designed penalty function, which is based on a piecewise linear function with adjustable slopes. Numerical results show that the resulting DL-aided decoders outperform the original ADMM-penalized decoder for various low density parity check (LDPC) codes with similar computational complexity.
SPJun 10, 2019
Learned Conjugate Gradient Descent Network for Massive MIMO DetectionYi Wei, Ming-Min Zhao, Mingyi Hong et al.
In this work, we consider the use of model-driven deep learning techniques for massive multiple-input multiple-output (MIMO) detection. Compared with conventional MIMO systems, massive MIMO promises improved spectral efficiency, coverage and range. Unfortunately, these benefits are coming at the cost of significantly increased computational complexity. To reduce the complexity of signal detection and guarantee the performance, we present a learned conjugate gradient descent network (LcgNet), which is constructed by unfolding the iterative conjugate gradient descent (CG) detector. In the proposed network, instead of calculating the exact values of the scalar step-sizes, we explicitly learn their universal values. Also, we can enhance the proposed network by augmenting the dimensions of these step-sizes. Furthermore, in order to reduce the memory costs, a novel quantized LcgNet is proposed, where a low-resolution nonuniform quantizer is integrated into the LcgNet to smartly quantize the aforementioned step-sizes. The quantizer is based on a specially designed soft staircase function with learnable parameters to adjust its shape. Meanwhile, due to fact that the number of learnable parameters is limited, the proposed networks are easy and fast to train. Numerical results demonstrate that the proposed network can achieve promising performance with much lower complexity.
ASMar 27, 2019
Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech RecognitionShiliang Zhang, Ming Lei, Zhijie Yan
Connectionist Temporal Classification (CTC) based end-to-end speech recognition system usually need to incorporate an external language model by using WFST-based decoding in order to achieve promising results. This is more essential to Mandarin speech recognition since it owns a special phenomenon, namely homophone, which causes a lot of substitution errors. The linguistic information introduced by language model will help to distinguish these substitution errors. In this work, we propose a transformer based spelling correction model to automatically correct errors especially the substitution errors made by CTC-based Mandarin speech recognition system. Specifically, we investigate using the recognition results generated by CTC-based systems as input and the ground-truth transcriptions as output to train a transformer with encoder-decoder architecture, which is much similar to machine translation. Results in a 20,000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3.41%, which results in 22.9% and 53.2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively.
ASMar 5, 2018
Linear networks based speaker adaptation for speech synthesisZhiying Huang, Heng Lu, Ming Lei et al.
Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, linear networks (LN) is inserted in multiple neural network layers and fine-tuned together with output layer for best speaker adaptation performance. When adaptation data is extremely small, the low-rank plus diagonal(LRPD) decomposition for LN is employed to make the adapted voice more stable. Speaker adaptation experiments are conducted under a range of adaptation utterances numbers. Moreover, speaker adaptation from 1) female to female, 2) male to female and 3) female to male are investigated. Objective measurement and subjective tests show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited, and our best speaker adaptation (SA) model with only 200 adaptation utterances achieves comparable quality with speaker dependent (SD) model trained with 1000 utterances, in both naturalness and similarity to target speaker.
NEMar 4, 2018
Deep-FSMN for Large Vocabulary Continuous Speech RecognitionShiliang Zhang, Ming Lei, Zhijie Yan et al.
In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications.
CLFeb 26, 2018
Deep Feed-forward Sequential Memory Networks for Speech SynthesisMengxiao Bi, Heng Lu, Shiliang Zhang et al.
The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody. However, the model complexity and inference cost of BLSTM prevents its usage in many runtime applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN) has shown its consistent out-performance over BLSTM in both word error rate (WER) and the runtime computation cost in speech recognition tasks. Since speech synthesis also requires to model long-term dependencies compared to speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in speech synthesis. Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time.
CVJun 4, 2017
Data preprocessing methods for robust Fourier ptychographic microscopyYan Zhang, An Pan, Ming Lei et al.
Fourier ptychographic microscopy (FPM) is a recently proposed computational imaging technique with both high resolution and wide field-of-view. In current FP experimental setup, the dark-field images with high-angle illuminations are easily submerged by stray light and background noise due to the low signal-to-noise ratio, thus significantly degrading the reconstruction quality and also imposing a major restriction on the synthetic numerical aperture (NA) of the FP approach. To this end, an overall and systematic data preprocessing scheme for noise removal from FP's raw dataset is provided, which involves sampling analysis as well as underexposed/overexposed treatments, then followed by the elimination of unknown stray light and suppression of inevitable background noise, especially Gaussian noise and CCD dark current in our experiments. The reported non-parametric scheme facilitates great enhancements of the FP's performance, which has been demonstrated experimentally that the benefits of noise removal by these methods far outweigh its defects of concomitant signal loss. In addition, it could be flexibly cooperated with the existing state-of-the-art algorithms, producing a stronger robustness of the FP approach in various applications.