Charlie Hou

LG
h-index7
8papers
140citations
Novelty56%
AI Score38

8 Papers

LGFeb 17, 2023
Privately Customizing Prefinetuning to Better Match User Data in Federated Learning

Charlie Hou, Hongyuan Zhan, Akshat Shrivastava et al. · meta-ai

In Federated Learning (FL), accessing private client data incurs communication and privacy costs. As a result, FL deployments commonly prefinetune pretrained foundation models on a (large, possibly public) dataset that is held by the central server; they then FL-finetune the model on a private, federated dataset held by clients. Evaluating prefinetuning dataset quality reliably and privately is therefore of high importance. To this end, we propose FreD (Federated Private Fréchet Distance) -- a privately computed distance between a prefinetuning dataset and federated datasets. Intuitively, it privately computes and compares a Fréchet distance between embeddings generated by a large language model on both the central (public) dataset and the federated private client data. To make this computation privacy-preserving, we use distributed, differentially-private mean and covariance estimators. We show empirically that FreD accurately predicts the best prefinetuning dataset at minimal privacy cost. Altogether, using FreD we demonstrate a proof-of-concept for a new approach in private FL training: (1) customize a prefinetuning dataset to better match user data (2) prefinetune (3) perform FL-finetuning.

LGJul 31, 2023
Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity

Charlie Hou, Kiran Koshy Thekumparampil, Michael Shavlovsky et al.

On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as 38% -- both overall and on outliers.

LGApr 23, 2025Code
POPri: Private Federated Learning using Preference-Optimized Synthetic Data

Charlie Hou, Mei-Yu Wang, Yige Zhu et al.

In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as an RL (reinforcement learning) reward. Our algorithm, Policy Optimization for Private Data (POPri) harnesses client feedback using policy optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri.

LGJun 5, 2024Code
PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan et al.

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($ε=1.29$, $ε=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

LGFeb 29, 2024
Characterizing the Training Dynamics of Private Fine-tuning with Langevin diffusion

Shuqi Ke, Charlie Hou, Sewoong Oh et al.

We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DP-LP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with ReLU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights. We also derive new upper bounds for 2-layer linear networks without the approximation. Moreover, our theory suggests a trade-off of privacy budget allocation in multi-phase fine-tuning methods like DP-LP-FFT.

LGAug 16, 2021
FedChain: Chained Algorithms for Near-Optimal Communication Cost in Federated Learning

Charlie Hou, Kiran K. Thekumparampil, Giulia Fanti et al.

Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local methods can exploit similarity between clients' data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds R. On the other hand, global methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of R but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local methods and global methods to achieve fast convergence in terms of R while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.

LGFeb 12, 2021
Efficient Algorithms for Federated Saddle Point Optimization

Charlie Hou, Kiran K. Thekumparampil, Giulia Fanti et al.

We consider strongly convex-concave minimax problems in the federated setting, where the communication constraint is the main bottleneck. When clients are arbitrarily heterogeneous, a simple Minibatch Mirror-prox achieves the best performance. As the clients become more homogeneous, using multiple local gradient updates at the clients significantly improves upon Minibatch Mirror-prox by communicating less frequently. Our goal is to design an algorithm that can harness the benefit of similarity in the clients while recovering the Minibatch Mirror-prox performance under arbitrary heterogeneity (up to log factors). We give the first federated minimax optimization algorithm that achieves this goal. The main idea is to combine (i) SCAFFOLD (an algorithm that performs variance reduction across clients for convex optimization) to erase the worst-case dependency on heterogeneity and (ii) Catalyst (a framework for acceleration based on modifying the objective) to accelerate convergence without amplifying client drift. We prove that this algorithm achieves our goal, and include experiments to validate the theory.

CRDec 4, 2019
SquirRL: Automating Attack Analysis on Blockchain Incentive Mechanisms with Deep Reinforcement Learning

Charlie Hou, Mingxun Zhou, Yan Ji et al.

Incentive mechanisms are central to the functionality of permissionless blockchains: they incentivize participants to run and secure the underlying consensus protocol. Designing incentive-compatible incentive mechanisms is notoriously challenging, however. As a result, most public blockchains today use incentive mechanisms whose security properties are poorly understood and largely untested. In this work, we propose SquirRL, a framework for using deep reinforcement learning to analyze attacks on blockchain incentive mechanisms. We demonstrate SquirRL's power by first recovering known attacks: (1) the optimal selfish mining attack in Bitcoin [52], and (2) the Nash equilibrium in block withholding attacks [16]. We also use SquirRL to obtain several novel empirical results. First, we discover a counterintuitive flaw in the widely used rushing adversary model when applied to multi-agent Markov games with incomplete information. Second, we demonstrate that the optimal selfish mining strategy identified in [52] is actually not a Nash equilibrium in the multi-agent selfish mining setting. In fact, our results suggest (but do not prove) that when more than two competing agents engage in selfish mining, there is no profitable Nash equilibrium. This is consistent with the lack of observed selfish mining in the wild. Third, we find a novel attack on a simplified version of Ethereum's finalization mechanism, Casper the Friendly Finality Gadget (FFG) that allows a strategic agent to amplify her rewards by up to 30%. Notably, [10] show that honest voting is a Nash equilibrium in Casper FFG: our attack shows that when Casper FFG is composed with selfish mining, this is no longer the case. Altogether, our experiments demonstrate SquirRL's flexibility and promise as a framework for studying attack settings that have thus far eluded theoretical and empirical understanding.