Personalized Chatbot Trustworthiness Ratings
This work addresses the need for users, developers, and regulators to evaluate and compare chatbots based on personalized trust criteria, though it is incremental as it builds on existing rating and aggregation methods.
The paper tackles the problem of assessing chatbot trustworthiness without modifying the chatbot or accessing its training data, proposing a personalized rating methodology that aggregates separate trust issue ratings based on user priorities, and demonstrates its application on four dialog datasets with user surveys.
Conversation agents, commonly referred to as chatbots, are increasingly deployed in many domains to allow people to have a natural interaction while trying to solve a specific problem. Given their widespread use, it is important to provide their users with methods and tools to increase users awareness of various properties of the chatbots, including non-functional properties that users may consider important in order to trust a specific chatbot. For example, users may want to use chatbots that are not biased, that do not use abusive language, that do not leak information to other users, and that respond in a style which is appropriate for the user's cognitive level. In this paper, we address the setting where a chatbot cannot be modified, its training data cannot be accessed, and yet a neutral party wants to assess and communicate its trustworthiness to a user, tailored to the user's priorities over the various trust issues. Such a rating can help users choose among alternative chatbots, developers test their systems, business leaders price their offering, and regulators set policies. We envision a personalized rating methodology for chatbots that relies on separate rating modules for each issue, and users' detected priority orderings among the relevant trust issues, to generate an aggregate personalized rating for the trustworthiness of a chatbot. The method is independent of the specific trust issues and is parametric to the aggregation procedure, thereby allowing for seamless generalization. We illustrate its general use, integrate it with a live chatbot, and evaluate it on four dialog datasets and representative user profiles, validated with user surveys.