Peter J. Shaw

h-index54

6papers

1,215citations

Novelty47%

AI Score33

Ranked #116,274 of 194,257 authors (top 60%)#21,390 in CL (top 69%)

6 Papers

26.0CLOct 7, 2022Code

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee, Mandar Joshi, Iulia Turc et al. · deepmind

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

31.1CLSep 29, 2022

Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing

Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie et al. · mit

A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar inputs. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandRfirst generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR sets the state of the art on multiple low-resource semantic parsing tasks.

39.5LGDec 14, 2023Code

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal et al. · deepmind

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. Third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. However, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.

3.8CRDec 20, 2021

Vulnerability Analysis of the Android Kernel

Joseph R. Barr, Peter Shaw, Tyler Thatcher

We describe a workflow used to analyze the source code of the {\sc Android OS kernel} and rate for a particular kind of bugginess that exposes a program to hacking. The workflow represents a novel approach for components' vulnerability rating. The approach is inspired by recent work on embedding source code functions. The workflow combines deep learning with heuristics and machine learning. Deep learning is used to embed function/method labels into a Euclidean space. Because the corpus of Android kernel source code is rather limited (containing approximately 2 million C/C++ functions \& Java methods), a straightforward embedding is untenable. To overcome the challenge of the dearth of data, it's necessary to go through an intermediate step of the \textit{Byte-Pair Encoding}. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory networks (LSTM) are used to embed tokens into vectors in $\mathbb{R}^d$ from which we form a \textit{cosine matrix} consisting of the cosine between every pair of vectors. The cosine matrix may be interpreted as a (combinatorial) `weighted' graph whose vertices represent functions/methods and `weighted' edges correspond to matrix entries. Features that include function vectors plus those defined heuristically are used to score for risk of bugginess.

1.6CLNov 9, 2021Code

Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks

Wang Zhu, Peter Shaw, Tal Linzen et al.

Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and constructions. We investigate learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-specific layers of the models are strategically trained differently on a pre-finetuning task such that they generalize well on mismatched splits that require compositionality. We apply this method to semantic parsing, using three very different datasets, COGS, GeoQuery and SCAN, used alternately as the pre-finetuning and target task. Our method significantly improves compositional generalization over baselines on the test set of the target task, which is held out during fine-tuning. Ablation studies characterize the utility of the major steps in the proposed algorithm and support our hypothesis.

1.6LGOct 6, 2021

The Variability of Model Specification

Joseph R. Barr, Peter Shaw, Marcus Sobel

It's regarded as an axiom that a good model is one that compromises between bias and variance. The bias is measured in training cost, while the variance of a (say, regression) model is measure by the cost associated with a validation set. If reducing bias is the goal, one will strive to fetch as complex a model as necessary, but complexity is invariably coupled with variance: greater complexity implies greater variance. In practice, driving training cost to near zero does not pose a fundamental problem; in fact, a sufficiently complex decision tree is perfectly capable of driving training cost to zero; however, the problem is often with controlling the model's variance. We investigate various regression model frameworks, including generalized linear models, Cox proportional hazard models, ARMA, and illustrate how misspecifying a model affects the variance.