Andrea De Lorenzo

LG
7papers
113citations
Novelty39%
AI Score40

7 Papers

62.3LGMar 10
Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure

Davide Tugnoli, Andrea De Lorenzo, Marco Virgolin et al.

Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN's generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

26.7LGMar 14
Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

Matheus Camilo da Silva, Gabriel Gustavo Costanzo, Andrea de Lorenzo et al.

Many machine learning classification tasks involve imbalanced datasets, which are often subject to over-sampling techniques aimed at improving model performance. However, these techniques are prone to generating unrealistic or infeasible samples. Furthermore, they often function as black boxes, lacking interpretability in their procedures. This opacity makes it difficult to track their effectiveness and provide necessary adjustments, and they may ultimately fail to yield significant performance improvements. To bridge this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a framework that extracts interpretable decision predicates from trained models to capture domain rules and enforce them during sample generation. This design ensures that over-sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on synthetic and real-world benchmark datasets, DPG-da consistently improves classification performance over traditional over-sampling methods, while guaranteeing logical validity and offering clear, interpretable explanations of the over-sampled data.

CVFeb 10, 2022
Adults as Augmentations for Children in Facial Emotion Recognition with Contrastive Learning

Marco Virgolin, Andrea De Lorenzo, Tanja Alderliesten et al.

Emotion recognition in children can help the early identification of, and intervention on, psychological complications that arise in stressful situations such as cancer treatment. Though deep learning models are increasingly being adopted, data scarcity is often an issue in pediatric medicine, including for facial emotion recognition in children. In this paper, we study the application of data augmentation-based contrastive learning to overcome data scarcity in facial emotion recognition for children. We explore the idea of ignoring generational gaps, by adding abundantly available adult data to pediatric data, to learn better representations. We investigate different ways by which adult facial expression images can be used alongside those of children. In particular, we propose to explicitly incorporate within each mini-batch adult images as augmentations for children's. Out of $84$ combinations of learning approaches and training set sizes, we find that supervised contrastive learning with the proposed training scheme performs best, reaching a test accuracy that typically surpasses the one of the second-best approach by 2% to 3%. Our results indicate that adult data can be considered to be a meaningful augmentation of pediatric data for the recognition of emotional facial expression in children, and open up the possibility for other applications of contrastive learning to improve pediatric care by complementing data of children with that of adults.

LGApr 13, 2021
Model Learning with Personalized Interpretability Estimation (ML-PIE)

Marco Virgolin, Andrea De Lorenzo, Francesca Randone et al.

High-stakes applications require AI-generated models to be interpretable. Current algorithms for the synthesis of potentially interpretable models rely on objectives or regularization terms that represent interpretability only coarsely (e.g., model size) and are not designed for a specific user. Yet, interpretability is intrinsically subjective. In this paper, we propose an approach for the synthesis of models that are tailored to the user by enabling the user to steer the model synthesis process according to her or his preferences. We use a bi-objective evolutionary algorithm to synthesize models with trade-offs between accuracy and a user-specific notion of interpretability. The latter is estimated by a neural network that is trained concurrently to the evolution using the feedback of the user, which is collected using uncertainty-based active learning. To maximize usability, the user is only asked to tell, given two models at the time, which one is less complex. With experiments on two real-world datasets involving 61 participants, we find that our approach is capable of learning estimations of interpretability that can be very different for different users. Moreover, the users tend to prefer models found using the proposed approach over models found using non-personalized interpretability indices.

LGApr 23, 2020
Learning a Formula of Interpretability to Learn Interpretable Formulas

Marco Virgolin, Andrea De Lorenzo, Eric Medvet et al.

Many risk-sensitive applications require Machine Learning (ML) models to be interpretable. Attempts to obtain interpretable models typically rely on tuning, by trial-and-error, hyper-parameters of model complexity that are only loosely related to interpretability. We show that it is instead possible to take a meta-learning approach: an ML model of non-trivial Proxies of Human Interpretability (PHIs) can be learned from human feedback, then this model can be incorporated within an ML training process to directly optimize for interpretability. We show this for evolutionary symbolic regression. We first design and distribute a survey finalized at finding a link between features of mathematical formulas and two established PHIs, simulatability and decomposability. Next, we use the resulting dataset to learn an ML model of interpretability. Lastly, we query this model to estimate the interpretability of evolving solutions within bi-objective genetic programming. We perform experiments on five synthetic and eight real-world symbolic regression problems, comparing to the traditional use of solution size minimization. The results show that the use of our model leads to formulas that are, for a same level of accuracy-interpretability trade-off, either significantly more or equally accurate. Moreover, the formulas are also arguably more interpretable. Given the very positive results, we believe that our approach represents an important stepping stone for the design of next-generation interpretable (evolutionary) ML algorithms.

ROJan 23, 2020
Design, Validation, and Case Studies of 2D-VSR-Sim, an Optimization-friendly Simulator of 2-D Voxel-based Soft Robots

Eric Medvet, Alberto Bartoli, Andrea De Lorenzo et al.

Voxel-based soft robots (VSRs) are aggregations of soft blocks whose design is amenable to optimization. We here present a software, 2D-VSR-Sim, for facilitating research concerning the optimization of VSRs body and brain. The software, written in Java, provides consistent interfaces for all the VSRs aspects suitable for optimization and considers by design the presence of sensing, i.e., the possibility of exploiting the feedback from the environment for controlling the VSR. We experimentally characterize, from a mechanical point of view, the VSRs that can be simulated with 2D-VSR-Sim and we discuss the computational burden of the simulation. Finally, we show how 2D-VSR-Sim can be used to repeat the experiments of significant previous studies and, in perspective, to provide experimental answers to a variety of research questions.

CRJun 8, 2018
(In)Secure Configuration Practices of WPA2 Enterprise Supplicants

Alberto Bartoli, Eric Medvet, Andrea De Lorenzo et al.

WPA2 Enterprise is a fundamental technology for secure communication in enterprise wireless networks. A key requirement of this technology is that WiFi-enabled devices (i.e., supplicants) be correctly configured before connecting to the enterprise wireless network. Supplicants that are not configured correctly may fall prey of attacks aimed at stealing the network credentials very easily. Such credentials have an enormous value because they usually unlock access to all enterprise services. In this work we investigate whether users and technicians are aware of these important and widespread risks. We conducted two extensive analyses: a survey among approximately 1000 users about how they configured their WiFi devices for enterprise network access; and, a review of approximately 310 network configuration guides made available by enterprise network administrators. The results provide strong indications that the key requirement of WPA2 Enterprise is violated systematically and thus can no longer be considered realistic.