Design and evaluation of AI copilots -- case studies of retail copilot templates
This work addresses the problem of developing reliable and human-centered AI assistants for businesses, particularly in retail, but it is incremental as it builds on existing practices without introducing major innovations.
The paper tackles the challenge of building effective AI copilots by presenting a systematic approach to design and evaluation, using retail domain case studies from Microsoft to illustrate key components like LLM architecture and responsible AI guardrails, and emphasizing the importance of testing for quality and safety.
Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively. A case study of developing copilot templates for the retail domain by Microsoft is used to illustrate the role and importance of each aspect. The first section explores the key technical components of a copilot's architecture, including the LLM, plugins for knowledge retrieval and actions, orchestration, system prompts, and responsible AI guardrails. The second section discusses testing and evaluation as a principled way to promote desired outcomes and manage unintended consequences when using AI in a business context. We discuss how to measure and improve its quality and safety, through the lens of an end-to-end human-AI decision loop framework. By providing insights into the anatomy of a copilot and the critical aspects of testing and evaluation, this paper provides concrete evidence of how good design and evaluation practices are essential for building effective, human-centered AI assistants.