Chris Hays

LG
h-index44
7papers
78citations
Novelty47%
AI Score47

7 Papers

LGJan 17, 2023
Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

Chris Hays, Zachary Schutzman, Manish Raghavan et al. · mit

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

CYDec 30, 2025
Statistical Guarantees in the Search for Less Discriminatory Algorithms

Chris Hays, Ben Laufer, Solon Barocas et al.

Recent scholarship has argued that firms building data-driven decision systems in high-stakes domains like employment, credit, and housing should search for "less discriminatory algorithms" (LDAs) (Black et al., 2024). That is, for a given decision problem, firms considering deploying a model should make a good-faith effort to find equally performant models with lower disparate impact across social groups. Evidence from the literature on model multiplicity shows that randomness in training pipelines can lead to multiple models with the same performance, but meaningful variations in disparate impact. This suggests that developers can find LDAs simply by randomly retraining models. Firms cannot continue retraining forever, though, which raises the question: What constitutes a good-faith effort? In this paper, we formalize LDA search via model multiplicity as an optimal stopping problem, where a model developer with limited information wants to produce strong evidence that they have sufficiently explored the space of models. Our primary contribution is an adaptive stopping algorithm that yields a high-probability upper bound on the gains achievable from a continued search, allowing the developer to certify (e.g., to a court) that their search was sufficient. We provide a framework under which developers can impose stronger assumptions about the distribution of models, yielding correspondingly stronger bounds. We validate the method on real-world credit, employment and housing datasets.

79.1LGMar 27
Strategic Candidacy in Generative AI Arenas

Chris Hays, Rachel Li, Bailey Flanigan et al.

AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.

LGNov 26, 2024
From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs

Cynthia Dwork, Chris Hays, Nicole Immorlica et al. · harvard

Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post-processing. We begin by observing that, by combining a slightly modified form of the online K29 star algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest. We apply these techniques to evolving graphs, obtaining online outcome-indistinguishable omnipredictors for rich -- possibly infinite -- sets of distinguishers that capture properties of pairs of nodes, and their neighborhoods. This yields, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.

MLApr 10, 2025
Double Machine Learning for Causal Inference under Shared-State Interference

Chris Hays, Manish Raghavan

Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).

LGSep 28, 2025
The Impossibility of Inverse Permutation Learning in Transformer Models

Rohan Alur, Chris Hays, Manish Raghavan et al.

In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

CYOct 21, 2020
The Effect of the Rooney Rule on Implicit Bias in the Long Term

L. Elisa Celis, Chris Hays, Anay Mehrotra et al.

A robust body of evidence demonstrates the adverse effects of implicit bias in various contexts--from hiring to health care. The Rooney Rule is an intervention developed to counter implicit bias and has been implemented in the private and public sectors. The Rooney Rule requires that a selection panel include at least one candidate from an underrepresented group in their shortlist of candidates. Recently, Kleinberg and Raghavan proposed a model of implicit bias and studied the effectiveness of the Rooney Rule when applied to a single selection decision. However, selection decisions often occur repeatedly over time. Further, it has been observed that, given consistent counterstereotypical feedback, implicit biases against underrepresented candidates can change. We consider a model of how a selection panel's implicit bias changes over time given their hiring decisions either with or without the Rooney Rule in place. Our main result is that, when the panel is constrained by the Rooney Rule, their implicit bias roughly reduces at a rate that is the inverse of the size of the shortlist--independent of the number of candidates, whereas without the Rooney Rule, the rate is inversely proportional to the number of candidates. Thus, when the number of candidates is much larger than the size of the shortlist, the Rooney Rule enables a faster reduction in implicit bias, providing an additional reason in favor of using it as a strategy to mitigate implicit bias. Towards empirically evaluating the long-term effect of the Rooney Rule in repeated selection decisions, we conduct an iterative candidate selection experiment on Amazon MTurk. We observe that, indeed, decision-makers subject to the Rooney Rule select more minority candidates in addition to those required by the rule itself than they would if no rule is in effect, and do so without considerably decreasing the utility of candidates selected.