Child Speech Recognition in Human-Robot Interaction: Problem Solved?
This addresses the challenge of enabling effective speech-based human-robot interaction for children, though it is incremental as it builds on existing data-driven advancements.
The paper tackles the problem of poor automated speech recognition for children's speech, which hinders child-robot interaction, and shows that recent models like OpenAI Whisper achieve 60.3% sentence-level accuracy with sub-second transcription times, indicating potential for usable autonomous interactions.
Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. Performance improves even more in highly structured interactions when priming models with specific phrases. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.