AI CL LGDec 7, 2017

End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient

Li Zhou, Kevin Small, Oleg Rokhlenko, Charles Elkan

arXiv:1712.02838v121.443 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of myopic utterance generation in goal-oriented dialog systems for companies using large transcript datasets, offering an incremental improvement over existing encoder-decoder and RL approaches.

The paper tackles the problem of goal-oriented dialog policy learning from unannotated corpora by proposing an offline reinforcement learning method that optimizes policies at both utterance and dialog levels, achieving results without requiring online interaction or explicit state definitions.

Learning a goal-oriented dialog policy is generally performed offline with supervised learning algorithms or online with reinforcement learning (RL). Additionally, as companies accumulate massive quantities of dialog transcripts between customers and trained human agents, encoder-decoder methods have gained popularity as agent utterances can be directly treated as supervision without the need for utterance-level annotations. However, one potential drawback of such approaches is that they myopically generate the next agent utterance without regard for dialog-level considerations. To resolve this concern, this paper describes an offline RL method for learning from unannotated corpora that can optimize a goal-oriented policy at both the utterance and dialog level. We introduce a novel reward function and use both on-policy and off-policy policy gradient to learn a policy offline without requiring online user interaction or an explicit state space definition.

View on arXiv PDF

Similar