Kexin Zhao

CL
h-index1
7papers
2,183citations
Novelty56%
AI Score47

7 Papers

LGOct 25, 2017Code
Trace norm regularization and faster inference for embedded speech recognition RNNs

Markus Kliegl, Siddharth Goyal, Kexin Zhao et al.

We propose and evaluate new techniques for compressing and speeding up dense matrix multiplications as found in the fully connected and recurrent layers of neural networks for embedded large vocabulary continuous speech recognition (LVCSR). For compression, we introduce and study a trace norm regularization technique for training low rank factored versions of matrix multiplications. Compared to standard low rank training, we show that our method leads to good accuracy versus number of parameter trade-offs and can be used to speed up training of large models. For speedup, we enable faster inference on ARM processors through new open sourced kernels optimized for small batch sizes, resulting in 3x to 7x speed ups over the widely used gemmlowp library. Beyond LVCSR, we expect our techniques and kernels to be more generally applicable to embedded neural networks with large fully connected or recurrent layers.

CLNov 20, 2025
Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

Kexin Zhao, Ken Forbus

Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.

LGAug 4, 2025
Multi-Treatment-DML: Causal Estimation for Multi-Dimensional Continuous Treatments with Monotonicity Constraints in Personal Loan Risk Optimization

Kexin Zhao, Bo Wang, Cuiying Zhao et al.

Optimizing credit limits, interest rates, and loan terms is crucial for managing borrower risk and lifetime value (LTV) in personal loan platform. However, counterfactual estimation of these continuous, multi-dimensional treatments faces significant challenges: randomized trials are often prohibited by risk controls and long repayment cycles, forcing reliance on biased observational data. Existing causal methods primarily handle binary/discrete treatments and struggle with continuous, multi-dimensional settings. Furthermore, financial domain knowledge mandates provably monotonic treatment-outcome relationships (e.g., risk increases with credit limit).To address these gaps, we propose Multi-Treatment-DML, a novel framework leveraging Double Machine Learning (DML) to: (i) debias observational data for causal effect estimation; (ii) handle arbitrary-dimensional continuous treatments; and (iii) enforce monotonic constraints between treatments and outcomes, guaranteeing adherence to domain requirements.Extensive experiments on public benchmarks and real-world industrial datasets demonstrate the effectiveness of our approach. Furthermore, online A/B testing conducted on a realworld personal loan platform, confirms the practical superiority of Multi-Treatment-DML in real-world loan operations.

ASSep 21, 2020
DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang et al.

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

SDDec 3, 2019
WaveFlow: A Compact Flow-based Model for Raw Audio

Wei Ping, Kainan Peng, Kexin Zhao et al.

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$\times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

CLJul 9, 2019
Multi-Speaker End-to-End Speech Synthesis

Jihyun Park, Kexin Zhao, Kainan Peng et al.

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.

CLMay 21, 2019
Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song et al.

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.