LGApr 2, 2024
Asymptotics of Language Model AlignmentJoy Qiping Yang, Salman Salamatian, Ziteng Sun et al.
Let $p$ denote a generative language model. Let $r$ denote a reward model that returns a scalar that captures the degree at which a draw from $p$ is preferred. The goal of language model alignment is to alter $p$ to a new distribution $φ$ that results in a higher expected reward while keeping $φ$ close to $p.$ A popular alignment method is the KL-constrained reinforcement learning (RL), which chooses a distribution $φ_Δ$ that maximizes $E_{φ_Δ} r(y)$ subject to a relative entropy constraint $KL(φ_Δ|| p) \leq Δ.$ Another simple alignment method is best-of-$N$, where $N$ samples are drawn from $p$ and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To further analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic behavior of both the best-of-$N$ alignment, and the KL-constrained RL method, in terms of information-theoretic quantities. We prove that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and we fully characterize its rate function. We also show that the rate of growth of the scaled cumulants of the reward is characterized by a proper Renyi cross entropy. Finally, we show that best-of-$N$ is asymptotically equivalent to KL-constrained RL solution by proving that their expected rewards are asymptotically equal, and concluding that the two distributions must be close in KL divergence.
ITSep 3, 2020
Network Coding-Based Post-Quantum CryptographyAlejandro Cohen, Rafael G. L. D'Oliveira, Salman Salamatian et al.
We propose a novel hybrid universal network-coding cryptosystem (HUNCC) to obtain secure post-quantum cryptography at high communication rates. The secure network-coding scheme we offer is hybrid in the sense that it combines information-theory security with public-key cryptography. In addition, the scheme is general and can be applied to any communication network, and to any public-key cryptosystem. Our hybrid scheme is based on the information theoretic notion of individual secrecy, which traditionally relies on the assumption that an eavesdropper can only observe a subset of the communication links between the trusted parties - an assumption that is often challenging to enforce. For this setting, several code constructions have been developed, where the messages are linearly mixed before transmission over each of the paths in a way that guarantees that an adversary which observes only a subset has sufficient uncertainty about each individual message. Instead, in this paper, we take a computational viewpoint, and construct a coding scheme in which an arbitrary secure cryptosystem is utilized on a subset of the links, while a pre-processing similar to the one in individual security is utilized. Under this scheme, we demonstrate 1) a computational security guarantee for an adversary which observes the entirety of the links 2) an information theoretic security guarantee for an adversary which observes a subset of the links, and 3) information rates which approach the capacity of the network and greatly improve upon the current solutions. A perhaps surprising consequence of our scheme is that, to guarantee a computational security level b, it is sufficient to encrypt a single link using a computational post-quantum scheme. In addition, the information rate approaches 1 as the number of communication links increases.
ITAug 21, 2020
Low Influence, Utility, and Independence in Differential Privacy: A Curious Case of $3 \choose 2$Rafael G. L. D'Oliveira, Salman Salamatian, Muriel Médard et al.
We study the relationship between randomized low influence functions and differentially private mechanisms. Our main aim is to formally determine whether differentially private mechanisms are low influence and whether low influence randomized functions can be differentially private. We show that differential privacy does not necessarily imply low influence in a formal sense. However, low influence implies approximate differential privacy. These results hold for both independent and non-independent randomized mechanisms, where an important instance of the former is the widely-used additive noise techniques in the differential privacy literature. Our study also reveals the interesting dynamics between utility, low influence, and independence of a differentially private mechanism. As the name of this paper suggests, we show that any two such features are simultaneously possible. However, in order to have a differentially private mechanism that has both utility and low influence, even under a very mild utility condition, one has to employ non-independent mechanisms.
MLFeb 21, 2019
Correspondence Analysis Using Neural NetworksHsiang Hsu, Salman Salamatian, Flavio P. Calmon
Correspondence analysis (CA) is a multivariate statistical tool used to visualize and interpret data dependencies. CA has found applications in fields ranging from epidemiology to social sciences. However, current methods used to perform CA do not scale to large, high-dimensional datasets. By re-interpreting the objective in CA using an information-theoretic tool called the principal inertia components, we demonstrate that performing CA is equivalent to solving a functional optimization problem over the space of finite variance functions of two random variable. We show that this optimization problem, in turn, can be efficiently approximated by neural networks. The resulting formulation, called the correspondence analysis neural network (CA-NN), enables CA to be performed at an unprecedented scale. We validate the CA-NN on synthetic data, and demonstrate how it can be used to perform CA on a variety of datasets, including food recipes, wine compositions, and images. Our results outperform traditional methods used in CA, indicating that CA-NN can serve as a new, scalable tool for interpretability and visualization of complex dependencies between random variables.
CYNov 29, 2018
Correspondence Analysis of Government Expenditure PatternsHsiang Hsu, Flavio P. Calmon, José Cândido Silveira Santos Filho et al.
We analyze expenditure patterns of discretionary funds by Brazilian congress members. This analysis is based on a large dataset containing over $7$ million expenses made publicly available by the Brazilian government. This dataset has, up to now, remained widely untouched by machine learning methods. Our main contributions are two-fold: (i) we provide a novel dataset benchmark for machine learning-based efforts for government transparency to the broader research community, and (ii) introduce a neural network-based approach for analyzing and visualizing outlying expense patterns. Our hope is that the approach presented here can inspire new machine learning methodologies for government transparency applicable to other developing nations.
LGJun 21, 2018
Generalizing Correspondence Analysis for Applications in Machine LearningHsiang Hsu, Salman Salamatian, Flavio P. Calmon
Correspondence analysis (CA) is a multivariate statistical tool used to visualize and interpret data dependencies by finding maximally correlated embeddings of pairs of random variables. CA has found applications in fields ranging from epidemiology to social sciences; however, current methods do not scale to large, high-dimensional datasets. In this paper, we provide a novel interpretation of CA in terms of an information-theoretic quantity called the principal inertia components. We show that estimating the principal inertia components, which consists in solving a functional optimization problem over the space of finite variance functions of two random variable, is equivalent to performing CA. We then leverage this insight to design novel algorithms to perform CA at an unprecedented scale. Particularly, we demonstrate how the principal inertia components can be reliably approximated from data using deep neural networks. Finally, we show how these maximally correlated embeddings of pairs of random variables in CA further play a central role in several learning problems including visualization of classification boundary and training process, and underlying recent multi-view and multi-modal learning methods.
ITMay 29, 2018
Why Botnets Work: Distributed Brute-Force Attacks Need No SynchronizationSalman Salamatian, Wasim Huleihel, Ahmad Beirami et al.
In September 2017, McAffee Labs quarterly report estimated that brute force attacks represent 20\% of total network attacks, making them the most prevalent type of attack ex-aequo with browser based vulnerabilities. These attacks have sometimes catastrophic consequences, and understanding their fundamental limits may play an important role in the risk assessment of password-secured systems, and in the design of better security protocols. While some solutions exist to prevent online brute-force attacks that arise from one single IP address, attacks performed by botnets are more challenging. In this paper, we analyze these distributed attacks by using a simplified model. Our aim is to understand the impact of distribution and asynchronization on the overall computational effort necessary to breach a system. Our result is based on Guesswork, a measure of the number of queries (guesses) required of an adversary before a correct sequence, such as a password, is found in an optimal attack. Guesswork is a direct surrogate for time and computational effort of guessing a sequence from a set of sequences with associated likelihoods. We model the lack of synchronization by a worst-case optimization in which the queries made by multiple adversarial agents are received in the worst possible order for the adversary, resulting in a min-max formulation. We show that, even without synchronization, and for sequences of growing length, the asymptotic optimal performance is achievable by using randomized guesses drawn from an appropriate distribution. Therefore, randomization is key for distributed asynchronous attacks. In other words, asynchronous guessers can asymptotically perform brute-force attacks as efficiently as synchronized guessers.
ITFeb 16, 2018
Generalizing Bottleneck ProblemsHsiang Hsu, Shahab Asoodeh, Salman Salamatian et al.
Given a pair of random variables $(X,Y)\sim P_{XY}$ and two convex functions $f_1$ and $f_2$, we introduce two bottleneck functionals as the lower and upper boundaries of the two-dimensional convex set that consists of the pairs $\left(I_{f_1}(W; X), I_{f_2}(W; Y)\right)$, where $I_f$ denotes $f$-information and $W$ varies over the set of all discrete random variables satisfying the Markov condition $W \to X \to Y$. Applying Witsenhausen and Wyner's approach, we provide an algorithm for computing boundaries of this set for $f_1$, $f_2$, and discrete $P_{XY}$. In the binary symmetric case, we fully characterize the set when (i) $f_1(t)=f_2(t)=t\log t$, (ii) $f_1(t)=f_2(t)=t^2-1$, and (iii) $f_1$ and $f_2$ are both $\ell^β$ norm function for $β\geq 2$. We then argue that upper and lower boundaries in (i) correspond to Mrs. Gerber's Lemma and its inverse (which we call Mr. Gerber's Lemma), in (ii) correspond to estimation-theoretic variants of Information Bottleneck and Privacy Funnel, and in (iii) correspond to Arimoto Information Bottleneck and Privacy Funnel.
CRAug 16, 2014
Managing your Private and Public Data: Bringing down Inference Attacks against your PrivacySalman Salamatian, Amy Zhang, Flavio du Pin Calmon et al.
We propose a practical methodology to protect a user's private data, when he wishes to publicly release data that is correlated with his private data, in the hope of getting some utility. Our approach relies on a general statistical inference framework that captures the privacy threat under inference attacks, given utility constraints. Under this framework, data is distorted before it is released, according to a privacy-preserving probabilistic mapping. This mapping is obtained by solving a convex optimization problem, which minimizes information leakage under a distortion constraint. We address practical challenges encountered when applying this theoretical framework to real world data. On one hand, the design of optimal privacy-preserving mechanisms requires knowledge of the prior distribution linking private data and data to be released, which is often unavailable in practice. On the other hand, the optimization may become untractable and face scalability issues when data assumes values in large size alphabets, or is high dimensional. Our work makes three major contributions. First, we provide bounds on the impact on the privacy-utility tradeoff of a mismatched prior. Second, we show how to reduce the optimization size by introducing a quantization step, and how to generate privacy mappings under quantization. Third, we evaluate our method on three datasets, including a new dataset that we collected, showing correlations between political convictions and TV viewing habits. We demonstrate that good privacy properties can be achieved with limited distortion so as not to undermine the original purpose of the publicly released data, e.g. recommendations.