LGMar 16, 2023Code
Learning with Noisy Labels through Learnable Weighting and Centroid SimilarityFarooq Ahmad Wani, Maria Sofia Bucarelli, Fabrizio Silvestri
We introduce a novel method for training machine learning models in the presence of noisy labels, which are prevalent in domains such as medical diagnosis and autonomous driving and have the potential to degrade a model's generalization performance. Inspired by established literature that highlights how deep learning models are prone to overfitting to noisy samples in the later epochs of training, we propose a strategic approach. This strategy leverages the distance to class centroids in the latent space and incorporates a discounting mechanism, aiming to diminish the influence of samples that lie distant from all class centroids. By doing so, we effectively counteract the adverse effects of noisy labels. The foundational premise of our approach is the assumption that samples situated further from their respective class centroid in the initial stages of training are more likely to be associated with noise. Our methodology is grounded in robust theoretical principles and has been validated empirically through extensive experiments on several benchmark datasets. Our results show that our method consistently outperforms the existing state-of-the-art techniques, achieving significant improvements in classification accuracy in the presence of noisy labels. The code for our proposed loss function and supplementary materials is available at https://github.com/wanifarooq/NCOD
AIOct 11, 2023
Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture DesignLev Telyatnikov, Maria Sofia Bucarelli, Guillermo Bernardez et al.
Most of the current hypergraph learning methodologies and benchmarking datasets in the hypergraph realm are obtained by lifting procedures from their graph analogs, leading to overshadowing specific characteristics of hypergraphs. This paper attempts to confront some pending questions in that regard: Q1 Can the concept of homophily play a crucial role in Hypergraph Neural Networks (HNNs)? Q2 Is there room for improving current HNN architectures by carefully addressing specific characteristics of higher-order networks? Q3 Do existing datasets provide a meaningful benchmark for HNNs? To address them, we first introduce a novel conceptualization of homophily in higher-order networks based on a Message Passing (MP) scheme, unifying both the analytical examination and the modeling of higher-order networks. Further, we investigate some natural, yet mostly unexplored, strategies for processing higher-order structures within HNNs such as keeping hyperedge-dependent node representations, or performing node/hyperedge stochastic samplings, leading us to the most general MP formulation up to date -MultiSet-, as well as to an original architecture design, MultiSetMixer. Finally, we conduct an extensive set of experiments that contextualize our proposals and successfully provide insights about our inquiries.
79.6LGMay 27
Compositional Generalization in Autoregressive Models via Logit CompositionAakash Kumar, Maria Sofia Bucarelli, Emanuele Natale
Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.
LGSep 8, 2024
ICML Topological Deep Learning Challenge 2024: Beyond the Graph DomainGuillermo Bernárdez, Lev Telyatnikov, Marco Montagna et al.
This paper describes the 2nd edition of the ICML Topological Deep Learning Challenge that was hosted within the ICML 2024 ELLIS Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM). The challenge focused on the problem of representing data in different discrete topological domains in order to bridge the gap between Topological Deep Learning (TDL) and other types of structured datasets (e.g. point clouds, graphs). Specifically, participants were asked to design and implement topological liftings, i.e. mappings between different data structures and topological domains --like hypergraphs, or simplicial/cell/combinatorial complexes. The challenge received 52 submissions satisfying all the requirements. This paper introduces the main scope of the challenge, and summarizes the main results and findings.
AIFeb 6
Same Answer, Different Representations: Hidden instability in VLMsFarooq Ahmad Wani, Alessandro Suglia, Rohit Saxena et al.
The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
LGOct 13, 2023
On Generalization Bounds for Projective ClusteringMaria Sofia Bucarelli, Matilde Fjeldsø Larsen, Chris Schwiegelshohn et al.
Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a convergence rate of $Ω\left(\sqrt{\frac{kj}{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.
15.1CLMar 23
Select, Label, Evaluate: Active Testing in NLPAntonio Purificato, Maria Sofia Bucarelli, Andrea Bacciu et al.
Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.
LGFeb 22, 2024Code
Link Prediction with Physics-Inspired Graph Neural NetworksAndrea Giuseppe Di Francesco, Francesco Caso, Maria Sofia Bucarelli et al.
The message-passing mechanism underlying Graph Neural Networks (GNNs) is not naturally suited for heterophilic datasets, where adjacent nodes often have different labels. Most solutions to this problem remain confined to the task of node classification. In this article, we focus on the valuable task of link prediction under heterophily, an interesting problem for recommendation systems, social network analysis, and other applications. GNNs like GRAFF have improved node classification under heterophily by incorporating physics biases in the architecture. Similarly, we propose GRAFF-LP, an extension of GRAFF for link prediction. We show that GRAFF-LP effectively discriminates existing from non-existing edges by learning implicitly to separate the edge gradients. Based on this information, we propose a new readout function inspired by physics. Remarkably, this new function not only enhances the performance of GRAFF-LP but also improves that of other baseline models, leading us to reconsider how every link prediction experiment has been conducted so far. Finally, we provide evidence that even simple GNNs did not experience greater difficulty in predicting heterophilic links compared to homophilic ones. This leads us to believe in the necessity for heterophily measures specifically tailored for link prediction, distinct from those used in node classification. The code and appendix are available at https://github.com/difra100/Link_Prediction_with_PIGNN_IJCNN.
LGNov 26, 2024
Task Singular Vectors: Reducing Task Interference in Model MergingAntonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.
Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural information and is susceptible to task interference. In this paper, we study task vectors at the layer level, focusing on task layer matrices and their singular value decomposition. In particular, we concentrate on the resulting singular vectors, which we refer to as Task Singular Vectors (TSV). Recognizing that layer task matrices are often low-rank, we propose TSV-Compress (TSV-C), a simple procedure that compresses them to 10% of their original size while retaining 99% of accuracy. We further leverage this low-rank space to define a new measure of task interference based on the interaction of singular vectors from different tasks. Building on these findings, we introduce TSV-Merge (TSV-M), a novel model merging approach that combines compression with interference reduction, significantly outperforming existing methods.
LGMar 21, 2024
$\nabla τ$: Gradient-based and Task-Agnostic machine UnlearningDaniel Trippa, Cesare Campagnano, Maria Sofia Bucarelli et al.
Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla τ$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla τ$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla τ$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.
LGNov 5, 2024
ATM: Improving Model Merging by Alternating Tuning and MergingLuca Zhou, Daniele Solombrino, Donato Crisostomi et al.
Model merging has emerged as a cost-efficient approximation to multitask learning. Among merging strategies, task arithmetic is notable for its simplicity and effectiveness. In this work, we provide a theoretical motivation for task vectors by highlighting that, under single-epoch full-batch gradient descent, they are equivalent to multitask gradients. This insight leads us to reinterpret model merging as a single step in an iterative procedure that Alternates between Tuning and Merging (ATM). We propose two applications of ATM: (1) as an alternative to multitask learning in scenarios where data sharing is restricted (e.g., federated settings), and (2) as a lightweight refinement step to improve existing model merging methods using a small validation set. Experiments across diverse vision tasks demonstrate the effectiveness of ATM.
LGJan 8, 2024
A topological description of loss surfaces based on Betti NumbersMaria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Monica Bianchini et al.
In the context of deep learning models, attention has recently been paid to studying the surface of the loss function in order to better understand training with methods based on gradient descent. This search for an appropriate description, both analytical and topological, has led to numerous efforts to identify spurious minima and characterize gradient dynamics. Our work aims to contribute to this field by providing a topological measure to evaluate loss complexity in the case of multilayer neural networks. We compare deep and shallow architectures with common sigmoidal activation functions by deriving upper and lower bounds on the complexity of their loss function and revealing how that complexity is influenced by the number of hidden units, training models, and the activation function used. Additionally, we found that certain variations in the loss function or model architecture, such as adding an $\ell_2$ regularization term or implementing skip connections in a feedforward network, do not affect loss topology in specific cases.
MLFeb 18, 2025
The Majority Vote Paradigm Shift: When Popular Meets OptimalAntonio Purificato, Maria Sofia Bucarelli, Anil Kumar Nelakanti et al.
Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among many aggregation methods, the simple and well known Majority Vote (MV) selects the class label polling the highest number of votes. However, despite its importance, the optimality of MV's label aggregation has not been extensively studied. We address this gap in our work by characterising the conditions under which MV achieves the theoretically optimal lower bound on label estimation error. Our results capture the tolerable limits on annotation noise under which MV can optimally recover labels for a given class distribution. This certificate of optimality provides a more principled approach to model selection for label aggregation as an alternative to otherwise inefficient practices that sometimes include higher experts, gold labels, etc., that are all marred by the same human uncertainty despite huge time and monetary costs. Experiments on both synthetic and real world data corroborate our theoretical findings.
LGAug 22, 2025
On Task Vectors and GradientsLuca Zhou, Daniele Solombrino, Donato Crisostomi et al.
Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
LGDec 11, 2024
Robustness of Graph Classification: failure modes, causes, and noise-resistant loss in Graph Neural NetworksFarooq Ahmad Wani, Maria Sofia Bucarelli, Andrea Giuseppe Di Francesco et al.
Graph Neural Networks (GNNs) are powerful at solving graph classification tasks, yet applied problems often contain noisy labels. In this work, we study GNN robustness to label noise, demonstrate GNN failure modes when models struggle to generalise on low-order graphs, low label coverage, or when a model is over-parameterized. We establish both empirical and theoretical links between GNN robustness and the reduction of the total Dirichlet Energy of learned node representations, which encapsulates the hypothesized GNN smoothness inductive bias. Finally, we introduce two training strategies to enhance GNN robustness: (1) by incorporating a novel inductive bias in the weight matrices through the removal of negative eigenvalues, connected to Dirichlet Energy minimization; (2) by extending to GNNs a loss penalty that promotes learned smoothness. Importantly, neither approach negatively impacts performance in noise-free settings, supporting our hypothesis that the source of GNNs robustness is their smoothness inductive bias.
LGNov 27, 2025
PISA: Prioritized Invariant Subgraph AggregationAli Ghasemi, Farooq Ahmad Wani, Maria Sofia Bucarelli et al.
Recent work has extended the invariance principle for out-of-distribution (OOD) generalization from Euclidean to graph data, where challenges arise due to complex structures and diverse distribution shifts in node attributes and topology. To handle these, Chen et al. proposed CIGA (Chen et al., 2022b), which uses causal modeling and an information-theoretic objective to extract a single invariant subgraph capturing causal features. However, this single-subgraph focus can miss multiple causal patterns. Liu et al. (2025) addressed this with SuGAr, which learns and aggregates diverse invariant subgraphs via a sampler and diversity regularizer, improving robustness but still relying on simple uniform or greedy aggregation. To overcome this, the proposed PISA framework introduces a dynamic MLP-based aggregation that prioritizes and combines subgraph representations more effectively. Experiments on 15 datasets, including DrugOOD (Ji et al., 2023), show that PISA achieves up to 5% higher classification accuracy than prior methods.
LGNov 24, 2025
Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task ArithmeticMostafa Mozafari, Farooq Ahmad Wani, Maria Sofia Bucarelli et al.
Corrupted training data are ubiquitous. Corrective Machine Unlearning (CMU) seeks to remove the influence of such corruption post-training. Prior CMU typically assumes access to identified corrupted training samples (a "forget set"). However, in many real-world scenarios the training data are no longer accessible. We formalize source-free CMU, where the original training data are unavailable and, consequently, no forget set of identified corrupted training samples can be specified. Instead, we assume a small proxy (surrogate) set of corrupted samples that reflect the suspected corruption type without needing to be the original training samples. In this stricter setting, methods relying on forget set are ineffective or narrow in scope. We introduce Corrective Unlearning in Task Space (CUTS), a lightweight weight space correction method guided by the proxy set using task arithmetic principles. CUTS treats the clean and the corruption signal as distinct tasks. Specifically, we briefly fine-tune the corrupted model on the proxy to amplify the corruption mechanism in the weight space, compute the difference between the corrupted and fine-tuned weights as a proxy task vector, and subtract a calibrated multiple of this vector to cancel the corruption. Without access to clean data or a forget set, CUTS recovers a large fraction of the lost utility under label noise and, for backdoor triggers, nearly eliminates the attack with minimal damage to utility, outperforming state-of-the-art specialized CMU methods in source-free setting.
LGMay 23, 2025
Early-Exit Graph Neural NetworksAndrea Giuseppe Di Francesco, Maria Sofia Bucarelli, Franco Maria Nardini et al.
Early-exit mechanisms allow deep neural networks to halt inference as soon as classification confidence is high enough, adaptively trading depth for confidence, and thereby cutting latency and energy on easy inputs while retaining full-depth accuracy for harder ones. Similarly, adding early exit mechanisms to Graph Neural Networks (GNNs), the go-to models for graph-structured data, allows for dynamic trading depth for confidence on simple graphs while maintaining full-depth accuracy on harder and more complex graphs to capture intricate relationships. Although early exits have proven effective across various deep learning domains, their potential within GNNs in scenarios that require deep architectures while resisting over-smoothing and over-squashing remains largely unexplored. We unlock that potential by first introducing Symmetric-Anti-Symmetric Graph Neural Networks (SAS-GNN), whose symmetry-based inductive biases mitigate these issues and yield stable intermediate representations that can be useful to allow early exiting in GNNs. Building on this backbone, we present Early-Exit Graph Neural Networks (EEGNNs), which append confidence-aware exit heads that allow on-the-fly termination of propagation based on each node or the entire graph. Experiments show that EEGNNs preserve robust performance as depth grows and deliver competitive accuracy on heterophilic and long-range benchmarks, matching attention-based and asynchronous message-passing models while substantially reducing computation and latency. We plan to release the code to reproduce our experiments.
LGApr 6, 2025
MASS: MoErging through Adaptive Subspace SelectionDonato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo et al.
Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.
NEOct 5, 2021
NEWRON: A New Generalization of the Artificial Neuron to Enhance the Interpretability of Neural NetworksFederico Siciliano, Maria Sofia Bucarelli, Gabriele Tolomei et al.
In this work, we formulate NEWRON: a generalization of the McCulloch-Pitts neuron structure. This new framework aims to explore additional desirable properties of artificial neurons. We show that some specializations of NEWRON allow the network to be interpretable with no change in their expressiveness. By just inspecting the models produced by our NEWRON-based networks, we can understand the rules governing the task. Extensive experiments show that the quality of the generated models is better than traditional interpretable models and in line or better than standard neural networks.