LGApr 15, 2024

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Javier Perera-Lago, Víctor Toscano-Durán, Eduardo Paluzo-Hidalgo, Sara Narteni, Matteo Rucco

arXiv:2404.09541v12.61 citationsh-index: 3Has CodexAI

Originality Synthesis-oriented

AI Analysis

This work addresses reliability in AI for vehicle safety applications, but it is incremental as it extends an existing method to a specific domain.

The paper tackles the problem of assessing dataset similarity for decision trees using the ε-representativeness method, providing a theoretical guarantee that predictions are similar if datasets are ε-representative and showing experimental correlation with feature importance ordering, extended to unseen vehicle collision data with XGBoost.

Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.

View on arXiv PDF Code

Similar