Adrian Benton

CL
h-index38
12papers
3,623citations
Novelty41%
AI Score40

12 Papers

CLMay 23, 2022
What Makes Data-to-Text Generation Hard for Pretrained Language Models?

Moniba Keymanesh, Adrian Benton, Mark Dredze

Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories. Previous work shows that pre-trained language models(PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how these data are incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation.

CLJan 25, 2023
Weakly Supervised Headline Dependency Parsing

Adrian Benton, Tianze Shi, Ozan İrsoy et al.

English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.

CLApr 14, 2024
Exploring and Improving Drafts in Blockwise Parallel Decoding

Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni et al.

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.

LGNov 24, 2025
CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

Ziteng Sun, Adrian Benton, Samuel Kushnir et al.

Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

CLApr 30, 2025
Improving Informally Romanized Language Identification

Adrian Benton, Alexander Gutkin, Christo Kirov et al.

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts - Hindi and Urdu, for example - highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

CLSep 15, 2021
Comparing Euclidean and Hyperbolic Embeddings on the WordNet Nouns Hypernymy Graph

Sameer Bansal, Adrian Benton

Nickel and Kiela (2017) present a new method for embedding tree nodes in the Poincare ball, and suggest that these hyperbolic embeddings are far more effective than Euclidean embeddings at embedding nodes in large, hierarchically structured graphs like the WordNet nouns hypernymy tree. This is especially true in low dimensions (Nickel and Kiela, 2017, Table 1). In this work, we seek to reproduce their experiments on embedding and reconstructing the WordNet nouns hypernymy graph. Counter to what they report, we find that Euclidean embeddings are able to represent this tree at least as well as Poincare embeddings, when allowed at least 50 dimensions. We note that this does not diminish the significance of their work given the impressive performance of hyperbolic embeddings in very low-dimensional settings. However, given the wide influence of their work, our aim here is to present an updated and more accurate comparison between the Euclidean and hyperbolic embeddings.

CLSep 15, 2021
Cross-Register Projection for Headline Part of Speech Tagging

Adrian Benton, Hanyang Li, Igor Malioutov

Part of speech (POS) tagging is a familiar NLP task. State of the art taggers routinely achieve token-level accuracies of over 97% on news body text, evidence that the problem is well understood. However, the register of English news headlines, "headlinese", is very different from the register of long-form text, causing POS tagging models to underperform on headlines. In this work, we automatically annotate news headlines with POS tags by projecting predicted tags from corresponding sentences in news bodies. We train a multi-domain POS tagger on both long-form and headline text and show that joint training on both registers improves over training on just one or naively concatenating training sets. We evaluate on a newly-annotated corpus of over 5,248 English news headlines from the Google sentence compression corpus, and show that our model yields a 23% relative error reduction per token and 19% per headline. In addition, we demonstrate that better headline POS tags can improve the performance of a syntax-based open information extraction system. We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines.

CLApr 28, 2021
Diversity-Aware Batch Active Learning for Dependency Parsing

Tianze Shi, Adrian Benton, Igor Malioutov et al.

While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversityaware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation.

CLDec 30, 2020
Corrected CBOW Performs as well as Skip-gram

Ozan İrsoy, Adrian Benton, Karl Stratos

Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official implementation, word2vec.c, and Gensim. We show that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train.

LGDec 2, 2018
Learning Representations of Social Media Users

Adrian Benton

User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.

CLDec 10, 2017
Multi-Task Learning for Mental Health using Social Media Text

Adrian Benton, Margaret Mitchell, Dirk Hovy

We introduce initial groundwork for estimating suicide risk and mental health in a deep learning framework. By modeling multiple conditions, the system learns to make predictions about suicide risk and mental health at a low false positive rate. Conditions are modeled as tasks in a multi-task learning (MTL) framework, with gender prediction as an additional auxiliary task. We demonstrate the effectiveness of multi-task learning by comparison to a well-tuned single-task baseline with the same number of parameters. Our best MTL model predicts potential suicide attempt, as well as the presence of atypical mental health, with AUC > 0.8. We also find additional large improvements using multi-task learning on mental health tasks with limited training data.

LGFeb 8, 2017
Deep Generalized Canonical Correlation Analysis

Adrian Benton, Huda Khayrallah, Biman Gujral et al.

We present Deep Generalized Canonical Correlation Analysis (DGCCA) -- a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA representations on two distinct datasets for three downstream tasks: phonetic transcription from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques.