Multisource AI Scorecard Table for System Evaluation
This work addresses the problem of inconsistent and opaque AI system evaluation for developers and users in both commercial and government sectors, offering an incremental step towards standardization.
This paper introduces the Multisource AI Scorecard Table (MAST), a standardized checklist derived from intelligence community analytic tradecraft principles (ICD 203) to evaluate AI/ML systems. It aims to foster more understandable systems and trust in AI outputs by assessing aspects like sourcing, uncertainty, consistency, accuracy, and visualization, illustrated with three notional security use cases.
The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist focused on the principles of good analysis adopted by the intelligence community (IC) to help promote the development of more understandable systems and engender trust in AI outputs. Such a scorecard enables a transparent, consistent, and meaningful understanding of AI tools applied for commercial and government use. A standard is built on compliance and agreement through policy, which requires buy-in from the stakeholders. While consistency for testing might only exist across a standard data set, the community requires discussion on verification and validation approaches which can lead to interpretability, explainability, and proper use. The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system supporting various operational needs. These include sourcing, uncertainty, consistency, accuracy, and visualization. Three use cases are presented as notional examples that support security for comparative analysis.