75.0HCApr 20
Navigating the Conceptual MultiverseAndre Ye, Jenny Y. Huang, Alicia Guo et al.
When language models answer open-ended problems, they implicitly make hidden decisions that shape their outputs, leaving users with uncontextualized answers rather than a working map of the problem; drawing on multiverse analysis from statistics, we build and evaluate the conceptual multiverse, an interactive system that represents conceptual decisions such as how to frame a question or what to value as a space users can transparently inspect, intervenably change, and check against principled domain reasoning; for this structure to be worth navigating rather than misleading, it must be rigorous and checkable against domain reasoning norms, so we develop a general verification framework that enforces properties of good decision structures like unambiguity and completeness calibrated by expert-level reasoning; across three domains, the conceptual multiverse helped participants develop a working map of the problem, with philosophy students rewriting essays with sharper framings and reversed theses, alignment annotators moving from surface preferences to reasoning about user intent and harm, and poets identifying compositional patterns that clarified their taste.
HCFeb 7, 2022
Jury Learning: Integrating Dissenting Voices into Machine Learning ModelsMitchell L. Gordon, Michelle S. Lam, Joon Sung Park et al.
Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups' labels. We introduce jury learning, a supervised ML approach that resolves these disagreements explicitly through the metaphor of a jury: defining which people or groups, in what proportion, determine the classifier's prediction. For example, a jury learning model for online toxicity might centrally feature women and Black jurors, who are commonly targets of online harassment. To enable jury learning, we contribute a deep learning architecture that models every annotator in a dataset, samples from annotators' models to populate the jury, then runs inference to classify. Our architecture enables juries that dynamically adapt their composition, explore counterfactuals, and visualize dissent.
CYApr 4, 2019
Goal-setting And Achievement In Activity Tracking Apps: A Case Study Of MyFitnessPalMitchell L. Gordon, Tim Althoff, Jure Leskovec
Activity tracking apps often make use of goals as one of their core motivational tools. There are two critical components to this tool: setting a goal, and subsequently achieving that goal. Despite its crucial role in how a number of prominent self-tracking apps function, there has been relatively little investigation of the goal-setting and achievement aspects of self-tracking apps. Here we explore this issue, investigating a particular goal setting and achievement process that is extensive, recorded, and crucial for both the app and its users' success: weight loss goals in MyFitnessPal. We present a large-scale study of 1.4 million users and weight loss goals, allowing for an unprecedented detailed view of how people set and achieve their goals. We find that, even for difficult long-term goals, behavior within the first 7 days predicts those who ultimately achieve their goals, that is, those who lose at least as much weight as they set out to, and those who do not. For instance, high amounts of early weight loss, which some researchers have classified as unsustainable, leads to higher goal achievement rates. We also show that early food intake, self-monitoring motivation, and attitude towards the goal are important factors. We then show that we can use our findings to predict goal achievement with an accuracy of 79% ROC AUC just 7 days after a goal is set. Finally, we discuss how our findings could inform steps to improve goal achievement in self-tracking apps.
CVApr 1, 2019
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative ModelsSharon Zhou, Mitchell L. Gordon, Ranjay Krishna et al.
Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.