CLSDASFeb 27, 2023

Multimodal Speech Recognition for Language-Guided Embodied Agents

UW
arXiv:2302.14030v36 citationsh-index: 27Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of deploying agents with spoken instructions, though it is incremental by building on existing ASR and embodied agent frameworks.

The paper tackles the problem of erroneous Automatic Speech Recognition (ASR) transcripts for language-guided embodied agents by proposing a multimodal ASR model that uses visual context to improve transcription accuracy, resulting in up to 30% more masked word recovery and better task completion rates.

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models. github.com/Cylumn/embodied-multimodal-asr

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes