Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks
This addresses query-by-example spoken term detection for speech retrieval applications, offering incremental improvements in efficiency and performance.
The paper tackles the problem of retrieving spoken content using spoken queries without transcription, proposing an end-to-end model based on an attention-based multi-hop network that outputs whether an audio segment includes the query. In supervised scenarios, attention and multiple hops improve performance, while in unsupervised settings, it matches existing system performance with lower search time complexity.
Retrieving spoken content with spoken queries, or query-by- example spoken term detection (STD), is attractive because it makes possible the matching of signals directly on the acoustic level without transcribing them into text. Here, we propose an end-to-end query-by-example STD model based on an attention-based multi-hop network, whose input is a spoken query and an audio segment containing several utterances; the output states whether the audio segment includes the query. The model can be trained in either a supervised scenario using labeled data, or in an unsupervised fashion. In the supervised scenario, we find that the attention mechanism and multiple hops improve performance, and that the attention weights indicate the time span of the detected terms. In the unsupervised setting, the model mimics the behavior of the existing query-by-example STD system, yielding performance comparable to the existing system but with a lower search time complexity.