Siamese Capsule Network for End-to-End Speaker Recognition In The Wild
This work addresses speaker recognition in uncontrolled environments, offering a more data-efficient solution, though it appears incremental as it builds on existing capsule network and ResNet methods.
The paper tackles speaker verification in the wild by proposing an end-to-end deep model that combines thin-ResNet for embeddings and a Siamese capsule network with dynamic routing for similarity scoring, achieving state-of-the-art performance with less training data.
We propose an end-to-end deep model for speaker verification in the wild. Our model uses thin-ResNet for extracting speaker embeddings from utterances and a Siamese capsule network and dynamic routing as the Back-end to calculate a similarity score between the embeddings. We conduct a series of experiments and comparisons on our model to state-of-the-art solutions, showing that our model outperforms all the other models using substantially less amount of training data. We also perform additional experiments to study the impact of different speaker embeddings on the Siamese capsule network. We show that the best performance is achieved by using embeddings obtained directly from the feature aggregation module of the Front-end and passing them to higher capsules using dynamic routing.