CLASDec 14, 2020

Towards localisation of keywords in speech using weak supervision

arXiv:2012.07396v15 citations
AI Analysis

This research addresses the problem of keyword localization in low-resource speech settings for developers and researchers, providing an incremental step towards practical solutions.

This paper explores keyword localization in speech using two weak supervision methods: bag-of-words (BoW) labels and visual context from image-paired utterances. The BoW-trained model performed better than the visually trained model, which sometimes located semantically related words but inconsistently.

Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) labelling. In the second, visual context is provided in the form of an image paired with an unlabelled utterance; a model then needs to be trained in a self-supervised fashion using the paired data. For keyword localisation, we adapt a saliency-based method typically used in the vision domain. We compare this to an existing technique that performs localisation as a part of the network architecture. While the saliency-based method is more flexible (it can be applied without architectural restrictions), we identify a critical limitation when using it for keyword localisation. Of the two forms of supervision, the visually trained model performs worse than the BoW-trained model. We show qualitatively that the visually trained model sometimes locate semantically related words, but this is not consistent. While our results show that there is some signal allowing for localisation, it also calls for other localisation methods better matched to these forms of weak supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes