Privately Customizing Prefinetuning to Better Match User Data in Federated Learning
This addresses the challenge of matching prefinetuning datasets to private user data in federated learning, offering a proof-of-concept for customization, though it is incremental as it builds on existing FL and privacy techniques.
The paper tackles the problem of evaluating prefinetuning dataset quality in federated learning by proposing FreD, a method that privately computes the Fréchet distance between datasets, which accurately predicts the best prefinetuning dataset at minimal privacy cost.
In Federated Learning (FL), accessing private client data incurs communication and privacy costs. As a result, FL deployments commonly prefinetune pretrained foundation models on a (large, possibly public) dataset that is held by the central server; they then FL-finetune the model on a private, federated dataset held by clients. Evaluating prefinetuning dataset quality reliably and privately is therefore of high importance. To this end, we propose FreD (Federated Private Fréchet Distance) -- a privately computed distance between a prefinetuning dataset and federated datasets. Intuitively, it privately computes and compares a Fréchet distance between embeddings generated by a large language model on both the central (public) dataset and the federated private client data. To make this computation privacy-preserving, we use distributed, differentially-private mean and covariance estimators. We show empirically that FreD accurately predicts the best prefinetuning dataset at minimal privacy cost. Altogether, using FreD we demonstrate a proof-of-concept for a new approach in private FL training: (1) customize a prefinetuning dataset to better match user data (2) prefinetune (3) perform FL-finetuning.