Ronnie Alves

h-index21

11papers

50citations

Novelty39%

AI Score32

Ranked #123,168 of 194,257 authors (top 63%)#27,091 in LG (top 67%)

11 Papers

7.8LGOct 4, 2022Code

Explanation-by-Example Based on Item Response Theory

Lucas F. F. Cardoso, José de S. Ribeiro, Vitor C. A. Santos et al.

Intelligent systems that use Machine Learning classification algorithms are increasingly common in everyday society. However, many systems use black-box models that do not have characteristics that allow for self-explanation of their predictions. This situation leads researchers in the field and society to the following question: How can I trust the prediction of a model I cannot understand? In this sense, XAI emerges as a field of AI that aims to create techniques capable of explaining the decisions of the classifier to the end-user. As a result, several techniques have emerged, such as Explanation-by-Example, which has a few initiatives consolidated by the community currently working with XAI. This research explores the Item Response Theory (IRT) as a tool to explaining the models and measuring the level of reliability of the Explanation-by-Example approach. To this end, four datasets with different levels of complexity were used, and the Random Forest model was used as a hypothesis test. From the test set, 83.8% of the errors are from instances in which the IRT points out the model as unreliable.

6.9LGOct 18, 2022Code

Explanations Based on Item Response Theory (eXirt): A Model-Specific Method to Explain Tree-Ensemble Model in Trust Perspective

José Ribeiro, Lucas Cardoso, Raíssa Silva et al.

In recent years, XAI researchers have been formalizing proposals and developing new methods to explain black box models, with no general consensus in the community on which method to use to explain these models, with this choice being almost directly linked to the popularity of a specific method. Methods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the proposal to explain black box models through global rankings of feature relevance, which based on different methodologies, generate global explanations that indicate how the model's inputs explain its predictions. In this context, 41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost, Random Forest, and Gradient Boosting), and 6 XAI methods were used to support the launch of a new XAI method, called eXirt, based on Item Response Theory - IRT and aimed at tree-ensemble black box models that use tabular data referring to binary classification problems. In the first set of analyses, the 164 global feature relevance ranks of the eXirt were compared with 984 ranks of the other XAI methods present in the literature, seeking to highlight their similarities and differences. In a second analysis, exclusive explanations of the eXirt based on Explanation-by-example were presented that help in understanding the model trust. Thus, it was verified that eXirt is able to generate global explanations of tree-ensemble models and also local explanations of instances of models through IRT, showing how this consolidated theory can be used in machine learning in order to obtain explainable and reliable models.

11.5LGSep 5, 2024Code

Standing on the shoulders of giants

Lucas Felipe Ferraro Cardoso, José de Sousa Ribeiro Filho, Vitor Cirilo Araujo Santos et al.

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.

7.9LGJul 3, 2024

How Reliable and Stable are Explanations of XAI Methods?

José Ribeiro, Lucas Cardoso, Vitor Santos et al.

Black box models are increasingly being used in the daily lives of human beings living in society. Along with this increase, there has been the emergence of Explainable Artificial Intelligence (XAI) methods aimed at generating additional explanations regarding how the model makes certain predictions. In this sense, methods such as Dalex, Eli5, eXirt, Lofo and Shap emerged as different proposals and methodologies for generating explanations of black box models in an agnostic way. Along with the emergence of these methods, questions arise such as "How Reliable and Stable are XAI Methods?". With the aim of shedding light on this main question, this research creates a pipeline that performs experiments using the diabetes dataset and four different machine learning models (LGBM, MLP, DT and KNN), creating different levels of perturbations of the test data and finally generates explanations from the eXirt method regarding the confidence of the models and also feature relevances ranks from all XAI methods mentioned, in order to measure their stability in the face of perturbations. As a result, it was found that eXirt was able to identify the most reliable models among all those used. It was also found that current XAI methods are sensitive to perturbations, with the exception of one specific method.

1.8LGOct 19, 2022Code

Black Box Model Explanations and the Human Interpretability Expectations -- An Analysis in the Context of Homicide Prediction

José Ribeiro, Níkolas Carneiro, Ronnie Alves

Strategies based on Explainable Artificial Intelligence (XAI) have promoted better human interpretability of the results of black box models. This opens up the possibility of questioning whether explanations created by XAI methods meet human expectations. The XAI methods being currently used (Ciu, Dalex, Eli5, Lofo, Shap, and Skater) provide various forms of explanations, including global rankings of relevance of features, which allow for an overview of how the model is explained as a result of its inputs and outputs. These methods provide for an increase in the explainability of the model and a greater interpretability grounded on the context of the problem. Intending to shed light on the explanations generated by XAI methods and their interpretations, this research addresses a real-world classification problem related to homicide prediction, already peer-validated, replicated its proposed black box model and used 6 different XAI methods to generate explanations and 6 different human experts. The results were generated through calculations of correlations, comparative analysis and identification of relationships between all ranks of features produced. It was found that even though it is a model that is difficult to explain, 75\% of the expectations of human experts were met, with approximately 48\% agreement between results from XAI methods and human experts. The results allow for answering questions such as: "Are the Expectation of Interpretation generated among different human experts similar?", "Do the different XAI methods generate similar explanations for the proposed problem?", "Can explanations generated by XAI methods meet human expectation of Interpretations?", and "Can Explanations and Expectations of Interpretation work together?".

4.1LGApr 13, 2025

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

Lucas Cardoso, Vitor Santos, José Ribeiro et al.

Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.

9.9LGJul 15, 2021

Data vs classifiers, who wins?

Lucas F. F. Cardoso, Vitor C. A. Santos, Regiane S. Kawasaki Francês et al.

The experiments covered by Machine Learning (ML) must consider two important aspects to assess the performance of a model: datasets and algorithms. Robust benchmarks are needed to evaluate the best classifiers. For this, one can adopt gold standard benchmarks available in public repositories. However, it is common not to consider the complexity of the dataset when evaluating. This work proposes a new assessment methodology based on the combination of Item Response Theory (IRT) and Glicko-2, a rating system mechanism generally adopted to assess the strength of players (e.g., chess). For each dataset in a benchmark, the IRT is used to estimate the ability of classifiers, where good classifiers have good predictions for the most difficult test instances. Tournaments are then run for each pair of classifiers so that Glicko-2 updates performance information such as rating value, rating deviation and volatility for each classifier. A case study was conducted hereby which adopted the OpenML-CC18 benchmark as the collection of datasets and pool of various classification algorithms for evaluation. Not all datasets were observed to be really useful for evaluating algorithms, where only 10% were considered really difficult. Furthermore, the existence of a subset containing only 50% of the original amount of OpenML-CC18 was verified, which is equally useful for algorithm evaluation. Regarding the algorithms, the methodology proposed herein identified the Random Forest as the algorithm with the best innate ability.

2.3GNFeb 2, 2021

A step toward a reinforcement learning de novo genome assembler

Kleber Padovani, Roberto Xavier, Rafael Cabral Borges et al.

De novo genome assembly is a relevant but computationally complex task in genomics. Although de novo assemblers have been used successfully in several genomics projects, there is still no 'best assembler', and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning may emerge as an alternative (or complementary) way for developing more accurate and automated assemblers. Reinforcement learning has proven promising for solving complex activities without supervision - such games - and there is a pressing need to understand the limits of this approach to 'real' problems, such as the DFA problem. This study aimed to shed light on the application of machine learning, using reinforcement learning (RL), in genome assembly. We expanded upon the sole previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and we provided insights for the next steps of automated genome assembly development. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing. We tested the new approaches on 23 new larger environments, which are all available on the internet. Our results suggest consistent performance progress; however, we also found limitations, especially concerning the high dimensionality of state and action spaces. Finally, we discuss paths for achieving efficient and automated genome assembly in real scenarios considering successful RL applications - including deep reinforcement learning.

2.3LGAug 26, 2020

NASirt: AutoML based learning with instance-level complexity information

Habib Asseiss Neto, Ronnie C. O. Alves, Sergio V. A. Campos

Designing adequate and precise neural architectures is a challenging task, often done by highly specialized personnel. AutoML is a machine learning field that aims to generate good performing models in an automated way. Spectral data such as those obtained from biological analysis have generally a lot of important information, and these data are specifically well suited to Convolutional Neural Networks (CNN) due to their image-like shape. In this work we present NASirt, an AutoML methodology based on Neural Architecture Search (NAS) that finds high accuracy CNN architectures for spectral datasets. The proposed methodology relies on the Item Response Theory (IRT) for obtaining characteristics from an instance level, such as discrimination and difficulty, and it is able to define a rank of top performing submodels. Several experiments are performed in order to demonstrate the methodology's performance with different spectral datasets. Accuracy results are compared to other benchmarks methods, such as a high performing, manually crafted CNN and the Auto-Keras AutoML tool. The results show that our method performs, in most cases, better than the benchmarks, achieving average accuracy as high as 97.40%.

4.1AIAug 16, 2020Code

Prediction of Homicides in Urban Centers: A Machine Learning Approach

José Ribeiro, Lair Meneses, Denis Costa et al.

Relevant research has been highlighted in the computing community to develop machine learning models capable of predicting the occurrence of crimes, analyzing contexts of crimes, extracting profiles of individuals linked to crime, and analyzing crimes over time. However, models capable of predicting specific crimes, such as homicide, are not commonly found in the current literature. This research presents a machine learning model to predict homicide crimes, using a dataset that uses generic data (without study location dependencies) based on incident report records for 34 different types of crimes, along with time and space data from crime reports. Experimentally, data from the city of Belém - Pará, Brazil was used. These data were transformed to make the problem generic, enabling the replication of this model to other locations. In the research, analyses were performed with simple and robust algorithms on the created dataset. With this, statistical tests were performed with 11 different classification methods and the results are related to the prediction's occurrence and non-occurrence of homicide crimes in the month subsequent to the occurrence of other registered crimes, with 76% assertiveness for both classes of the problem, using Random Forest. Results are considered as a baseline for the proposed problem.

5.0LGJul 29, 2020Code

Decoding machine learning benchmarks

Lucas F. F. Cardoso, Vitor C. A. Santos, Regiane S. K. Francês et al.

Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers' ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.