AICLLGDec 7, 2017

End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

arXiv:1712.02838v143 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of myopic utterance generation in goal-oriented dialog systems for companies using large transcript datasets, offering an incremental improvement over existing encoder-decoder and RL approaches.

The paper tackles the problem of goal-oriented dialog policy learning from unannotated corpora by proposing an offline reinforcement learning method that optimizes policies at both utterance and dialog levels, achieving results without requiring online interaction or explicit state definitions.

Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes