AS SDDec 22, 2019

End-Point Detection with State Transition Model based on Chunk-Wise Classification

arXiv:1912.10442v11.2

Originality Incremental advance

AI Analysis

This work addresses robust end-point detection for speech processing in noisy conditions, but it is incremental as it builds on existing voice activity detection with a modified aggregation approach.

The paper tackled the problem of end-point detection errors in noisy environments by proposing a state transition model based on chunk-wise classification, which reduced phone error rates compared to frame-level methods.

A state transition model (STM) based on chunk-wise classification was proposed for end-point detection (EPD). In general, EPD is developed using frame-wise voice activity detection (VAD) with additional STM, in which the state transition is conducted based on VAD's frame-level decision (speech or non-speech). However, VAD errors frequently occur in noisy environments, even though we use state-of-the-art deep neural network based VAD, which causes the undesired state transition of STM. In this work, to build robust STM, a state transition is conducted based on chunk-wise classification as EPD does not need to be conducted in frame-level. The chunk consists of multiple frames and the classification of chunk between speech and non-speech is done by aggregating the decisions of VAD for multiple frames, so that some undesired VAD errors in a chunk can be smoothed by other correct VAD decisions. Finally, the model was evaluated in both qualitative and quantitative measures including phone error rate.

View on arXiv PDF

Similar