Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals
This addresses the need for efficient language detection in voice recognition systems, though it is incremental as it applies existing deep learning methods to a specific domain.
The paper tackles the problem of automated language identification from audio signals by using spectrograms as inputs to a convolutional neural network, achieving 97% accuracy for binary classification and 89% for six-language classification on 3.75-second clips.
The first step in any voice recognition software is to determine what language a speaker is using, and ideally this process would be automated. The technique described in this paper, language identification for audio spectrograms (LIFAS), uses spectrograms generated from audio signals as inputs to a convolutional neural network (CNN) to be used for language identification. LIFAS requires minimal pre-processing on the audio signals as the spectrograms are generated during each batch as they are input to the network during training. LIFAS utilizes deep learning tools that are shown to be successful on image processing tasks and applies it to audio signal classification. LIFAS performs binary language classification with an accuracy of 97\%, and multi-class classification with six languages at an accuracy of 89\% on 3.75 second audio clips.