LGMay 19, 2022
A Boosting Algorithm for Positive-Unlabeled LearningYawen Zhao, Mingzhe Zhang, Chenhao Zhang et al.
Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. Many recent PU methods are based on neural networks, but little has been done to develop boosting algorithms for PU learning, despite boosting algorithms' strong performance on many fully supervised classification problems. In this paper, we propose a novel boosting algorithm, AdaPU, for PU learning. Similarly to AdaBoost, AdaPU aims to optimize an empirical exponential loss, but the loss is based on the PU data, rather than on positive-negative (PN) data. As in AdaBoost, we learn a weighted combination of weak classifiers by learning one weak classifier and its weight at a time. However, AdaPU requires a very different algorithm for learning the weak classifiers and determining their weights. This is because AdaPU learns a weak classifier and its weight using a weighted positive-negative (PN) dataset with some negative data weights $-$ the dataset is derived from the original PU data, and the data weights are determined by the current weighted classifier combination, but some data weights are negative. Our experiments showed that AdaPU outperforms neural networks on several benchmark PU datasets, including a large-scale challenging cyber security dataset.
LGMar 31, 2024
Label-Agnostic Forgetting: A Supervision-Free Unlearning in Deep ModelsShaofei Shen, Chenhao Zhang, Yawen Zhao et al.
Machine unlearning aims to remove information derived from forgotten data while preserving that of the remaining dataset in a well-trained model. With the increasing emphasis on data privacy, several approaches to machine unlearning have emerged. However, these methods typically rely on complete supervision throughout the unlearning process. Unfortunately, obtaining such supervision, whether for the forgetting or remaining data, can be impractical due to the substantial cost associated with annotating real-world datasets. This challenge prompts us to propose a supervision-free unlearning approach that operates without the need for labels during the unlearning process. Specifically, we introduce a variational approach to approximate the distribution of representations for the remaining data. Leveraging this approximation, we adapt the original model to eliminate information from the forgotten data at the representation level. To further address the issue of lacking supervision information, which hinders alignment with ground truth, we introduce a contrastive loss to facilitate the matching of representations between the remaining data and those of the original model, thus preserving predictive performance. Experimental results across various unlearning tasks demonstrate the effectiveness of our proposed method, Label-Agnostic Forgetting (LAF) without using any labels, which achieves comparable performance to state-of-the-art methods that rely on full supervision information. Furthermore, our approach excels in semi-supervised scenarios, leveraging limited supervision information to outperform fully supervised baselines. This work not only showcases the viability of supervision-free unlearning in deep models but also opens up a new possibility for future research in unlearning at the representation level.
LGJul 21, 2025
Machine Unlearning for Streaming ForgettingShaofei Shen, Chenhao Zhang, Yawen Zhao et al.
Machine unlearning aims to remove knowledge of the specific training data in a well-trained model. Currently, machine unlearning methods typically handle all forgetting data in a single batch, removing the corresponding knowledge all at once upon request. However, in practical scenarios, requests for data removal often arise in a streaming manner rather than in a single batch, leading to reduced efficiency and effectiveness in existing methods. Such challenges of streaming forgetting have not been the focus of much research. In this paper, to address the challenges of performance maintenance, efficiency, and data access brought about by streaming unlearning requests, we introduce a streaming unlearning paradigm, formalizing the unlearning as a distribution shift problem. We then estimate the altered distribution and propose a novel streaming unlearning algorithm to achieve efficient streaming forgetting without requiring access to the original training data. Theoretical analyses confirm an $O(\sqrt{T} + V_T)$ error bound on the streaming unlearning regret, where $V_T$ represents the cumulative total variation in the optimal solution over $T$ learning rounds. This theoretical guarantee is achieved under mild conditions without the strong restriction of convex loss function. Experiments across various models and datasets validate the performance of our proposed method.
LGJun 12, 2024
GENIU: A Restricted Data Access Unlearning for Imbalanced DataChenhao Zhang, Shaofei Shen, Yawen Zhao et al.
With the increasing emphasis on data privacy, the significance of machine unlearning has grown substantially. Class unlearning, which involves enabling a trained model to forget data belonging to a specific class learned before, is important as classification tasks account for the majority of today's machine learning as a service (MLaaS). Retraining the model on the original data, excluding the data to be forgotten (a.k.a forgetting data), is a common approach to class unlearning. However, the availability of original data during the unlearning phase is not always guaranteed, leading to the exploration of class unlearning with restricted data access. While current unlearning methods with restricted data access usually generate proxy sample via the trained neural network classifier, they typically focus on training and forgetting balanced data. However, the imbalanced original data can cause trouble for these proxies and unlearning, particularly when the forgetting data consists predominantly of the majority class. To address this issue, we propose the GENerative Imbalanced Unlearning (GENIU) framework. GENIU utilizes a Variational Autoencoder (VAE) to concurrently train a proxy generator alongside the original model. These generated proxies accurately represent each class and are leveraged in the unlearning phase, eliminating the reliance on the original training data. To further mitigate the performance degradation resulting from forgetting the majority class, we introduce an in-batch tuning strategy that works with the generated proxies. GENIU is the first practical framework for class unlearning in imbalanced data settings and restricted data access, ensuring the preservation of essential information for future unlearning. Experimental results confirm the superiority of GENIU over existing methods, establishing its effectiveness in empirical scenarios.