SD LG ASMar 29, 2024

Voice Signal Processing for Machine Learning. The Case of Speaker Isolation

arXiv:2403.20202v1

Originality Synthesis-oriented

AI Analysis

It addresses the problem of simplifying voice recognition tasks for ML engineers by offering accessible guidance on signal processing techniques, though it is incremental as it reviews existing methods.

This paper provides a comparative analysis of Fourier and Wavelet transforms for audio signal preprocessing in machine learning tasks, aiming to help ML engineers choose and evaluate decomposition methods effectively without deep signal processing expertise.

The widespread use of automated voice assistants along with other recent technological developments have increased the demand for applications that process audio signals and human voice in particular. Voice recognition tasks are typically performed using artificial intelligence and machine learning models. Even though end-to-end models exist, properly pre-processing the signal can greatly reduce the complexity of the task and allow it to be solved with a simpler ML model and fewer computational resources. However, ML engineers who work on such tasks might not have a background in signal processing which is an entirely different area of expertise. The objective of this work is to provide a concise comparative analysis of Fourier and Wavelet transforms that are most commonly used as signal decomposition methods for audio processing tasks. Metrics for evaluating speech intelligibility are also discussed, namely Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI). The level of detail in the exposition is meant to be sufficient for an ML engineer to make informed decisions when choosing, fine-tuning, and evaluating a decomposition method for a specific ML model. The exposition contains mathematical definitions of the relevant concepts accompanied with intuitive non-mathematical explanations in order to make the text more accessible to engineers without deep expertise in signal processing. Formal mathematical definitions and proofs of theorems are intentionally omitted in order to keep the text concise.

View on arXiv PDF

Similar