CLNov 22, 2017

Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models

arXiv:1711.08493v133 citations
Originality Incremental advance
AI Analysis

This addresses the problem of improving conversational agents for users by enabling online learning in dialog systems, though it is incremental as it builds on existing bandit and neural methods.

The paper tackles online response selection in retrieval-based dialog systems by proposing a contextual multi-armed bandit with a nonlinear reward function using distributed text representations, achieving significant performance gains over linear contextual bandits on the Ubuntu Dialogue Corpus.

Dialog response selection is an important step towards natural response generation in conversational agents. Existing work on neural conversational models mainly focuses on offline supervised learning using a large set of context-response pairs. In this paper, we focus on online learning of response selection in retrieval-based dialog systems. We propose a contextual multi-armed bandit model with a nonlinear reward function that uses distributed representation of text for online response selection. A bidirectional LSTM is used to produce the distributed representations of dialog context and responses, which serve as the input to a contextual bandit. In learning the bandit, we propose a customized Thompson sampling method that is applied to a polynomial feature space in approximating the reward. Experimental results on the Ubuntu Dialogue Corpus demonstrate significant performance gains of the proposed method over conventional linear contextual bandits. Moreover, we report encouraging response selection performance of the proposed neural bandit model using the Recall@k metric for a small set of online training samples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes