CLJun 9, 2022
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsAarohi Srivastava, Abhinav Rastogi, Abhishek Rao et al. · allen-ai, amazon-science
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
HCJul 16, 2021
Dark Patterns in Online Shopping: Of Sneaky Tricks, Perceived Annoyance and Respective Brand TrustChristian Voigt, Stephan Schlögl, Aleksander Groth
Dark patterns utilize interface elements to trick users into performing unwanted actions. Online shopping websites often employ these manipulative mechanisms so as to increase their potential customer base, to boost their sales, or to optimize their advertising efforts. Although dark patterns are often successful, they clearly inhibit positive user experiences. Particularly, with respect to customers' perceived annoyance and trust put into a given brand, they may have negative effects. To investigate respective connections between the use of dark patterns, users' perceived level of annoyance and their expressed brand trust, we conducted an experiment-based survey. We implemented two versions of a fictitious online shop; i.e. one which used five different types of dark patterns and a similar one without such manipulative user interface elements. A total of $n=204$ participants were then forwarded to one of the two shops (approx. $2/3$ to the shop which used the dark patterns) and asked to buy a specific product. Subsequently, we measured participants' perceived annoyance level, their expressed brand trust and their affinity for technology. Results show a higher level of perceived annoyance with those who used the dark pattern version of the online shop. Also, we found a significant connection between perceived annoyance and participants' expressed brand trust. A connection between participants' affinity for technology and their ability to recognize and consequently counter dark patterns, however, is not supported by our data.
CLMar 24, 2021
Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2Gregor Betz, Kyle Richardson, Christian Voigt
Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task's context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.
CLSep 15, 2020
Critical Thinking for Language ModelsGregor Betz, Christian Voigt, Kyle Richardson
This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic corpus of deductively valid arguments, and generate artificial argumentative texts to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on three simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for NLU benchmarks. In particular, pre-training on the argument schemes raises zero-shot accuracy on the GLUE diagnostics by up to 15 percentage points. The findings suggest that intermediary pre-training on texts that exemplify basic reasoning abilities (such as typically covered in critical thinking textbooks) might help language models to acquire a broad range of reasoning skills. The synthetic argumentative texts presented in this paper are a promising starting point for building such a "critical thinking curriculum for language models."