AINov 12, 2025
Consensus Sampling for Safer Generative AIAdam Tauman Kalai, Yael Tauman Kalai, Or Zamir
Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.
LGJul 10, 2020
Beyond Perturbations: Learning Guarantees with Arbitrary Adversarial Test ExamplesShafi Goldwasser, Adam Tauman Kalai, Yael Tauman Kalai et al.
We present a transductive learning algorithm that takes as input training examples from a distribution $P$ and arbitrary (unlabeled) test examples, possibly chosen by an adversary. This is unlike prior work that assumes that test examples are small perturbations of $P$. Our algorithm outputs a selective classifier, which abstains from predicting on some examples. By considering selective transductive learning, we give the first nontrivial guarantees for learning classes of bounded VC dimension with arbitrary train and test distributions---no prior guarantees were known even for simple classes of functions such as intervals on the line. In particular, for any function in a class $C$ of bounded VC dimension, we guarantee a low test error rate and a low rejection rate with respect to $P$. Our algorithm is efficient given an Empirical Risk Minimizer (ERM) for $C$. Our guarantees hold even for test examples chosen by an unbounded white-box adversary. We also give guarantees for generalization, agnostic, and unsupervised settings.
CRMar 5, 2015
Adaptively Secure Coin-Flipping, RevisitedShafi Goldwasser, Yael Tauman Kalai, Sunoo Park
The full-information model was introduced by Ben-Or and Linial in 1985 to study collective coin-flipping: the problem of generating a common bounded-bias bit in a network of $n$ players with $t=t(n)$ faults. They showed that the majority protocol can tolerate $t=O(\sqrt n)$ adaptive corruptions, and conjectured that this is optimal in the adaptive setting. Lichtenstein, Linial, and Saks proved that the conjecture holds for protocols in which each player sends a single bit. Their result has been the main progress on the conjecture in the last 30 years. In this work we revisit this question and ask: what about protocols involving longer messages? Can increased communication allow for a larger fraction of faulty players? We introduce a model of strong adaptive corruptions, where in each round, the adversary sees all messages sent by honest parties and, based on the message content, decides whether to corrupt a party (and intercept his message) or not. We prove that any one-round coin-flipping protocol, regardless of message length, is secure against at most $\tilde{O}(\sqrt n)$ strong adaptive corruptions. Thus, increased message length does not help in this setting. We then shed light on the connection between adaptive and strongly adaptive adversaries, by proving that for any symmetric one-round coin-flipping protocol secure against $t$ adaptive corruptions, there is a symmetric one-round coin-flipping protocol secure against $t$ strongly adaptive corruptions. Returning to the standard adaptive model, we can now prove that any symmetric one-round protocol with arbitrarily long messages can tolerate at most $\tilde{O}(\sqrt n)$ adaptive corruptions. At the heart of our results lies a novel use of the Minimax Theorem and a new technique for converting any one-round secure protocol into a protocol with messages of $polylog(n)$ bits. This technique may be of independent interest.
CRJan 1, 2014
The impossibility of obfuscation with auxiliary input or a universal simulatorNir Bitansky, Ran Canetti, Henry Cohn et al.
In this paper we show that the existence of general indistinguishability obfuscators conjectured in a few recent works implies, somewhat counterintuitively, strong impossibility results for virtual black box obfuscation. In particular, we show that indistinguishability obfuscation for all circuits implies: * The impossibility of average-case virtual black box obfuscation with auxiliary input for any circuit family with super-polynomial pseudo-entropy. Such circuit families include all pseudo-random function families, and all families of encryption algorithms and randomized digital signatures that generate their required coin flips pseudo-randomly. Impossibility holds even when the auxiliary input depends only on the public circuit family, and not the specific circuit in the family being obfuscated. * The impossibility of average-case virtual black box obfuscation with a universal simulator (with or without any auxiliary input) for any circuit family with super-polynomial pseudo-entropy. These bounds significantly strengthen the impossibility results of Goldwasser and Kalai (STOC 2005).