ITMay 31, 2022
Private Federated Submodel Learning with SparsificationSajani Vithana, Sennur Ulukus
We investigate the problem of private read update write (PRUW) in federated submodel learning (FSL) with sparsification. In FSL, a machine learning model is divided into multiple submodels, where each user updates only the submodel that is relevant to the user's local data. PRUW is the process of privately performing FSL by reading from and writing to the required submodel without revealing the submodel index, or the values of updates to the databases. Sparsification is a widely used concept in learning, where the users update only a small fraction of parameters to reduce the communication cost. Revealing the coordinates of these selected (sparse) updates leaks privacy of the user. We show how PRUW in FSL can be performed with sparsification. We propose a novel scheme which privately reads from and writes to arbitrary parameters of any given submodel, without revealing the submodel index, values of updates, or the coordinates of the sparse updates, to databases. The proposed scheme achieves significantly lower reading and writing costs compared to what is achieved without sparsification.
ITJul 3, 2024
Correlated Privacy Mechanisms for Differentially Private Distributed Mean EstimationSajani Vithana, Viveck R. Cadambe, Flavio P. Calmon et al.
Differentially private distributed mean estimation (DP-DME) is a fundamental building block in privacy-preserving federated learning, where a central server estimates the mean of $d$-dimensional vectors held by $n$ users while ensuring $(ε,δ)$-DP. Local differential privacy (LDP) and distributed DP with secure aggregation (SA) are the most common notions of DP used in DP-DME settings with an untrusted server. LDP provides strong resilience to dropouts, colluding users, and adversarial attacks, but suffers from poor utility. In contrast, SA-based DP-DME achieves an $O(n)$ utility gain over LDP in DME, but requires increased communication and computation overheads and complex multi-round protocols to handle dropouts and attacks. In this work, we present a generalized framework for DP-DME, that captures LDP and SA-based mechanisms as extreme cases. Our framework provides a foundation for developing and analyzing a variety of DP-DME protocols that leverage correlated privacy mechanisms across users. To this end, we propose CorDP-DME, a novel DP-DME mechanism based on the correlated Gaussian mechanism, that spans the gap between DME with LDP and distributed DP. We prove that CorDP-DME offers a favorable balance between utility and resilience to dropout and collusion. We provide an information-theoretic analysis of CorDP-DME, and derive theoretical guarantees for utility under any given privacy parameters and dropout/colluding user thresholds. Our results demonstrate that (anti) correlated Gaussian DP mechanisms can significantly improve utility in mean estimation tasks compared to LDP -- even in adversarial settings -- while maintaining better resilience to dropouts and attacks compared to distributed DP.
AIJul 11, 2024
Multi-Group Proportional Representation in RetrievalAlex Oesterling, Claudio Mayrink Verdun, Carol Xuan Long et al.
Image search and retrieval tasks can perpetuate harmful stereotypes, erase cultural identities, and amplify social disparities. Current approaches to mitigate these representational harms balance the number of retrieved items across population groups defined by a small number of (often binary) attributes. However, most existing methods overlook intersectional groups determined by combinations of group attributes, such as gender, race, and ethnicity. We introduce Multi-Group Proportional Representation (MPR), a novel metric that measures representation across intersectional groups. We develop practical methods for estimating MPR, provide theoretical guarantees, and propose optimization algorithms to ensure MPR in retrieval. We demonstrate that existing methods optimizing for equal and proportional representation metrics may fail to promote MPR. Crucially, our work shows that optimizing MPR yields more proportional representation across multiple intersectional groups specified by a rich function class, often with minimal compromise in retrieval accuracy.
LGFeb 6
ArcMark: Multi-bit LLM Watermark via Optimal TransportAtefeh Gilani, Carol Xuan Long, Sajani Vithana et al.
Watermarking is an important tool for promoting the responsible use of language models (LMs). Existing watermarks insert a signal into generated tokens that either flags LM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent multi-bit watermarks insert several bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. Notably, the information-theoretic capacity of multi-bit watermarking -- the maximum number of bits per token that can be inserted and detected without changing average next-token predictions -- has remained unknown. We address this gap by deriving the first capacity characterization of multi-bit watermarks. Our results inform the design of ArcMark: a new watermark construction based on coding-theoretic principles that, under certain assumptions, achieves the capacity of the multi-bit watermark channel. In practice, ArcMark outperforms competing multi-bit watermarks in terms of bit rate per token and detection accuracy. Our work demonstrates that LM watermarking is fundamentally a channel coding problem, paving the way for principled coding-theoretic approaches to watermark design.
CVMay 29, 2025
Multi-Group Proportional Representation for Text-to-Image ModelsSangwon Jung, Alex Oesterling, Claudio Mayrink Verdun et al.
Text-to-image (T2I) generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the "safe" and "responsible" design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in image generation. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by T2I models by applying the Multi-Group Proportional Representation (MPR) metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. We also develop an algorithm to optimize T2I models for this metric. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.
ITFeb 7, 2022
Private Read Update Write (PRUW) with Storage Constrained DatabasesSajani Vithana, Sennur Ulukus
We investigate the problem of private read update write (PRUW) in relation to federated submodel learning (FSL) with storage constrained databases. In PRUW, a user privately reads a submodel from a system of $N$ databases containing $M$ submodels, updates it locally, and writes the update back to the databases without revealing the submodel index or the value of the update. The databases considered in this problem are only allowed to store a given amount of information specified by an arbitrary storage constraint. We provide a storage mechanism that determines the contents of each database prior to the application of the PRUW scheme, such that the total communication cost is minimized. We show that the proposed storage scheme achieves a lower total cost compared to what is achieved by using \emph{coded storage} or \emph{divided storage} to meet the given storage constraint.
ITMar 30, 2020
Semantic Private Information RetrievalSajani Vithana, Karim Banawan, Sennur Ulukus
We investigate the problem of semantic private information retrieval (semantic PIR). In semantic PIR, a user retrieves a message out of $K$ independent messages stored in $N$ replicated and non-colluding databases without revealing the identity of the desired message to any individual database. The messages come with \emph{different semantics}, i.e., the messages are allowed to have \emph{non-uniform a priori probabilities} denoted by $(p_i>0,\: i \in [K])$, which are a proxy for their respective popularity of retrieval, and \emph{arbitrary message sizes} $(L_i,\: i \in [K])$. This is a generalization of the classical private information retrieval (PIR) problem, where messages are assumed to have equal a priori probabilities and equal message sizes. We derive the semantic PIR capacity for general $K$, $N$. The results show that the semantic PIR capacity depends on the number of databases $N$, the number of messages $K$, the a priori probability distribution of messages $p_i$, and the message sizes $L_i$. We present two achievable semantic PIR schemes: The first one is a deterministic scheme which is based on message asymmetry. This scheme employs non-uniform subpacketization. The second scheme is probabilistic and is based on choosing one query set out of multiple options at random to retrieve the required message without the need for exponential subpacketization. We derive necessary and sufficient conditions for the semantic PIR capacity to exceed the classical PIR capacity with equal priors and sizes. Our results show that the semantic PIR capacity can be larger than the classical PIR capacity when longer messages have higher popularities. However, when messages are equal-length, the non-uniform priors cannot be exploited to improve the retrieval rate over the classical PIR capacity.