Say What I Want: Towards the Dark Side of Neural Dialogue Models
This work addresses a security problem for chatbot services by revealing a weakness that could lead to misuse, though it is incremental in exploring a known vulnerability.
The paper tackles the security vulnerability of neural dialogue models, where they can be manipulated to generate targeted outputs, and demonstrates that a reinforcement learning-based approach can successfully craft such inputs in a considerable portion of cases.
Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations. However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about the security of practical chatbot services. In this work, we investigate whether we can craft inputs that lead a well-trained black-box neural dialogue model to generate targeted outputs. We formulate this as a reinforcement learning (RL) problem and train a Reverse Dialogue Generator which efficiently finds such inputs for targeted outputs. Experiments conducted on a representative neural dialogue model show that our proposed model is able to discover such desired inputs in a considerable portion of cases. Overall, our work reveals this weakness of neural dialogue models and may prompt further researches of developing corresponding solutions to avoid it.