Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations
This work addresses semantic learning from audio for speech processing applications, representing an incremental improvement over existing methods.
The paper tackles the problem of learning semantics from raw audio signals by proposing a framework that uses both contextual and phonetic representations, achieving better semantic learning than models using only one type of representation on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset.
We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation.