Reasoning About Pragmatics with Neural Listeners and Speakers
This addresses the challenge of generating context-aware language for human-computer interaction, representing an incremental improvement over prior methods.
The paper tackles the problem of generating pragmatic scene descriptions by combining learned semantics with inference-driven pragmatics, achieving an 81% success rate in human evaluations on a referring expression game, compared to 69% with existing techniques.
We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like previous learned approaches to language generation, our model uses a simple feature-driven architecture (here a pair of neural "listener" and "speaker" models) to ground language in the world. Like inference-driven approaches to pragmatics, our model actively reasons about listener behavior when selecting utterances. For training, our approach requires only ordinary captions, annotated _without_ demonstration of the pragmatic behavior the model ultimately exhibits. In human evaluations on a referring expression game, our approach succeeds 81% of the time, compared to a 69% success rate using existing techniques.