One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier
This addresses the need for efficient speaker identification in service robotics, though it is incremental as it adapts existing verification methods to a specific domain.
The paper tackles the problem of speaker identification for service robots where new users frequently appear, by developing a Siamese CNN-based verifier that enables one-shot learning without retraining. The result is a system evaluated for performance, speed, and real-life accuracy, showing viability as an alternative.
In service robotics, there is an interest to identify the user by voice alone. However, in application scenarios where a service robot acts as a waiter or a store clerk, new users are expected to enter the environment frequently. Typically, speaker identification models need to be retrained when this occurs, which can take an impractical amount of time. In this paper, a new approach for speaker identification through verification has been developed using a Siamese Convolutional Neural Network architecture (SCNN), where it learns to generically verify if two audio signals are from the same speaker. By having an external database of recorded audio of the users, identification is carried out by verifying the speech input with each of its entries. If new users are encountered, it is only required to add their recorded audio to the external database to be able to be identified, without retraining. The system was evaluated in four different aspects: the performance of the verifier, the performance of the system as a classifier using clean audio, its speed, and its accuracy in real-life settings. Its performance in conjunction with its one-shot-learning capabilities, makes the proposed system a viable alternative for speaker identification for service robots.