ASCLLGFeb 8, 2024

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

arXiv:2402.05819v12 citationsh-index: 162024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Originality Incremental advance
AI Analysis

This addresses the challenge of improving semantic understanding in speech models for SLU applications without costly speech-text data, representing an incremental advance in training methodology.

The paper tackles the problem that self-supervised speech models focus on frame-level objectives, which limits semantic comprehension for spoken language understanding tasks, by proposing PW-HuBERT, a framework that integrates pseudo word-level targets from a visually-grounded speech model without needing speech-text paired data. The model shows superiority on four SLU benchmarks, though no concrete numbers are provided.

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes