Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs
This addresses the challenge of designing fluid spoken dialog systems by incorporating multimodal cues, though it appears incremental in method.
The paper tackled the problem of coordinating turn-taking in conversational interactions by proposing a multiscale RNN architecture that models linguistic, acoustic, and gaze features at separate temporal rates, showing this approach improves turn-taking modeling.
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.