On the limitation of evaluating machine unlearning using only a single training seed
This addresses a methodological issue for researchers and practitioners in machine learning who rely on empirical evaluations of approximate unlearning algorithms.
The paper demonstrates that evaluating machine unlearning algorithms using only a single training seed can produce non-representative results due to high sensitivity to random seeds, and recommends incorporating variability across multiple training seeds for more reliable empirical comparisons.
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.