LGMar 22, 2024
An In-Depth Analysis of Data Reduction Methods for Sustainable Deep LearningVíctor Toscano-Durán, Javier Perera-Lago, Eduardo Paluzo-Hidalgo et al.
In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
LGApr 15, 2024
Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision dataJavier Perera-Lago, Víctor Toscano-Durán, Eduardo Paluzo-Hidalgo et al.
Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.
63.8QMMar 11
PesTwin: a biology-informed Digital Twin for enabling precision farmingAndrea De Antoni, Matteo Rucco, Alberto Maria Cattaneo et al.
In a context of growing agricultural demand and new challenges related to food security and accessibility, boosting agricultural productivity is more important than ever. Reducing the damage caused by invasive insect species is a crucial lever to achieve this objective. In support of these challenges, and in line with the principles of precision agriculture and Integrated Pest Management (IPM), an innovative simulation framework is presented, aiming to become the digital twin of a pest invasion. Through a flexible rule-based approach of the Agent-Based Modeling (ABM) paradigm, the framework supports the fine-tuning of the main ecological interactions of the pest with its crop host and the environment. Forecasting insect infestation in realistic scenarios, considering both spatial and temporal dimensions, is made possible by integrating heterogeneous data sources: pest biodata collected in the laboratory, environmental data from weather stations, and GIS data of a real crop field. In this study, an application to the global pest of soft fruit, the invasive fruit fly Drosophila suzukii, also known as Spotted Wing Drosophila (SWD), is presented.
IVDec 17, 2019
Towards personalized diagnosis of Glioblastoma in Fluid-attenuated inversion recovery (FLAIR) by topological interpretable machine learningMatteo Rucco, Lorenzo Falsetti, Giovanna Viticchi
Glioblastoma multiforme (GBM) is a fast-growing and highly invasive brain tumour, it tends to occur in adults between the ages of 45 and 70 and it accounts for 52 percent of all primary brain tumours. Usually, GBMs are detected by magnetic resonance images (MRI). Among MRI, Fluid-attenuated inversion recovery (FLAIR) sequence produces high quality digital tumour representation. Fast detection and segmentation techniques are needed for overcoming subjective medical doctors (MDs) judgment. In the present investigation, we intend to demonstrate by means of numerical experiments that topological features combined with textural features can be enrolled for GBM analysis and morphological characterization on FLAIR. To this extent, we have performed three numerical experiments. In the first experiment, Topological Data Analysis (TDA) of a simplified 2D tumour growth mathematical model had allowed to understand the bio-chemical conditions that facilitate tumour growth: the higher the concentration of chemical nutrients the more virulent the process. In the second experiment topological data analysis was used for evaluating GBM temporal progression on FLAIR recorded within 90 days following treatment (e.g., chemo-radiation therapy - CRT) completion and at progression. The experiment had confirmed that persistent entropy is a viable statistics for monitoring GBM evolution during the follow-up period. In the third experiment we had developed a novel methodology based on topological and textural features and automatic interpretable machine learning for automatic GBM classification on FLAIR. The algorithm reached a classification accuracy up to the 97%.
MED-PHSep 19, 2014
Neural Hypernetwork Approach for Pulmonary Embolism diagnosisMatteo Rucco, David M. S. Rodrigues, Emanuela Merelli et al.
This work introduces an integrative approach based on Q-analysis with machine learning. The new approach, called Neural Hypernetwork, has been applied to a case study of pulmonary embolism diagnosis. The objective of the application of neural hyper-network to pulmonary embolism (PE) is to improve diagnose for reducing the number of CT-angiography needed. Hypernetworks, based on topological simplicial complex, generalize the concept of two-relation to many-body relation. Furthermore, Hypernetworks provide a significant generalization of network theory, enabling the integration of relational structure, logic and analytic dynamics. Another important results is that Q-analysis stays close to the data, while other approaches manipulate data, projecting them into metric spaces or applying some filtering functions to highlight the intrinsic relations. A pulmonary embolism (PE) is a blockage of the main artery of the lung or one of its branches, frequently fatal. Our study uses data on 28 diagnostic features of 1,427 people considered to be at risk of PE. The resulting neural hypernetwork correctly recognized 94% of those developing a PE. This is better than previous results that have been obtained with other methods (statistical selection of features, partial least squares regression, topological data analysis in a metric space).