RAISE: A Unified Framework for Responsible AI Scoring and Evaluation
This addresses the need for multi-dimensional evaluation in high-stakes domains like finance and healthcare, though it is incremental as it builds on existing responsible AI concepts.
The paper tackles the problem of evaluating AI models beyond predictive accuracy by introducing RAISE, a unified framework that quantifies performance across explainability, fairness, robustness, and sustainability, and finds that no single model dominates across all criteria, with trade-offs such as the Transformer excelling in explainability and fairness but at high environmental cost.
As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise-framework/raise.