Adam Lopez

CL
h-index1
29papers
11,861citations
Novelty43%
AI Score34

29 Papers

LGOct 16, 2023Code
Taming the Sigmoid Bottleneck: Provably Argmaxable Sparse Multi-Label Classification

Andreas Grivas, Antonio Vergari, Adam Lopez

Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to $k$ active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.

CLNov 8, 2023
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Naomi Saphra, Eve Fleisig, Kyunghyun Cho et al. · cmu, harvard

Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.

LGMar 12, 2022
Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice

Andreas Grivas, Nikolay Bogoychev, Adam Lopez

Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to be predicted via argmax, irrespective of input features, and empirically, there is evidence this happens in small language models. In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such \emph{unargmaxable} tokens in public models. We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our code so that others can inspect their models.

CLNov 12, 2019Code
How to Evaluate Word Representations of Informal Domain?

Yekun Chai, Naomi Saphra, Adam Lopez

Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear-chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.

CLJan 10, 2025
LLMs Reproduce Stereotypes of Sexual and Gender Minorities

Ruby Ostrow, Adam Lopez

A large body of research has found substantial gender bias in NLP systems. Most of this research takes a binary, essentialist view of gender: limiting its variation to the categories _men_ and _women_, conflating gender with sex, and ignoring different sexual identities. But gender and sexuality exist on a spectrum, so in this paper we study the biases of large language models (LLMs) towards sexual and gender minorities beyond binary categories. Grounding our study in a widely used social psychology model -- the Stereotype Content Model -- we demonstrate that English-language survey questions about social perceptions elicit more negative stereotypes of sexual and gender minorities from both humans and LLMs. We then extend this framework to a more realistic use case: text generation. Our analysis shows that LLMs generate stereotyped representations of sexual and gender minorities in this setting, showing that they amplify representational harms in creative writing, a widely advertised use for LLMs.

CLMay 22, 2023
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis

Seraphina Goldfarb-Tarrant, Björn Ross, Adam Lopez

Sentiment analysis (SA) systems are widely deployed in many of the world's languages, and there is well-documented evidence of demographic bias in these systems. In languages beyond English, scarcer training data is often supplemented with transfer learning using pre-trained models, including multilingual models trained on other languages. In some cases, even supervision data comes from other languages. Does cross-lingual transfer also import new biases? To answer this question, we use counterfactual evaluation to test whether gender or racial biases are imported when using cross-lingual transfer, compared to a monolingual transfer setting. Across five languages, we find that systems using cross-lingual transfer usually become more biased than their monolingual counterparts. We also find racial biases to be much more prevalent than gender biases. To spur further research on this topic, we release the sentiment models we used for this study, and the intermediate checkpoints throughout training, yielding 1,525 distinct models; we also release our evaluation code.

CLMay 19, 2023
Bias Beyond English: Counterfactual Tests for Bias in Sentiment Analysis in Four Languages

Seraphina Goldfarb-Tarrant, Adam Lopez, Roi Blanco et al.

Sentiment analysis (SA) systems are used in many products and hundreds of languages. Gender and racial biases are well-studied in English SA systems, but understudied in other languages, with few resources for such studies. To remedy this, we build a counterfactual evaluation corpus for gender and racial/migrant bias in four languages. We demonstrate its usefulness by answering a simple but important question that an engineer might need to answer when deploying a system: What biases do systems import from pre-trained models when compared to a baseline with no pre-training? Our evaluation corpus, by virtue of being counterfactual, not only reveals which models have less bias, but also pinpoints changes in model bias behaviour, which enables more targeted mitigation strategies. We release our code and evaluation corpora to facilitate future research.

CLDec 31, 2020
Intrinsic Bias Metrics Do Not Correlate with Application Bias

Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sanchez et al.

Natural Language Processing (NLP) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. To guide efforts at debiasing these systems, the NLP community relies on a variety of metrics that quantify bias in models. Some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. Do these intrinsic and extrinsic metrics correlate with each other? We compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. Our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. We urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. To aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.

CLOct 21, 2020
LemMED: Fast and Effective Neural Morphological Analysis with Short Context Windows

Aibek Makazhanov, Sharon Goldwater, Adam Lopez

We present LemMED, a character-level encoder-decoder for contextual morphological analysis (combined lemmatization and tagging). LemMED extends and is named after two other attention-based models, namely Lematus, a contextual lemmatizer, and MED, a morphological (re)inflection model. Our approach does not require training separate lemmatization and tagging models, nor does it need additional resources and tools, such as morphological dictionaries or transducers. Moreover, LemMED relies solely on character-level representations and on local context. Although the model can, in principle, account for global context on sentence level, our experiments show that using just a single word of context around each target word is not only more computationally feasible, but yields better results as well. We evaluate LemMED in the framework of the SIMGMORPHON-2019 shared task on combined lemmatization and tagging. In terms of average performance LemMED ranks 5th among 13 systems and is bested only by the submissions that use contextualized embeddings.

CLOct 6, 2020
LSTMs Compose (and Learn) Bottom-Up

Naomi Saphra, Adam Lopez

Recent work in NLP shows that LSTM language models capture hierarchical structure in language data. In contrast to existing work, we consider the \textit{learning} process that leads to their compositional behavior. For a closer look at how an LSTM's sequential representations are composed hierarchically, we present a related measure of Decompositional Interdependence (DI) between word meanings in an LSTM, based on their gate interactions. We connect this measure to syntax with experiments on English language data, where DI is higher on pairs of words with lower syntactic distance. To explore the inductive biases that cause these compositional representations to arise during training, we conduct simple experiments on synthetic data. These synthetic experiments support a specific hypothesis about how hierarchical structures are discovered over the course of training: that LSTM constituent representations are learned bottom-up, relying on effective representations of their shorter children, rather than learning the longer-range relations independently from children.

CLMay 18, 2020
Inflecting when there's no majority: Limitations of encoder-decoder neural networks as cognitive models for German plurals

Kate McCurdy, Sharon Goldwater, Adam Lopez

Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Kirov and Cotterell (2018) argue that the answer is yes: modern Encoder-Decoder (ED) architectures learn human-like behavior when inflecting English verbs, such as extending the regular past tense form -(e)d to novel words. However, their work does not address the criticism raised by Marcus et al. (1995): that neural models may learn to extend not the regular, but the most frequent class -- and thus fail on tasks like German number inflection, where infrequent suffixes like -s can still be productively generalized. To investigate this question, we first collect a new dataset from German speakers (production and ratings of plural forms for novel nouns) that is designed to avoid sources of information unavailable to the ED model. The speaker data show high variability, and two suffixes evince 'regular' behavior, appearing more often with phonologically atypical inputs. Encoder-decoder models do generalize the most frequently produced plural class, but do not show human-like variability or 'regular' extension of these other plural markers. We conclude that modern neural models may still struggle with minority-class generalization.

CLApr 27, 2020
Word Interdependence Exposes How LSTMs Compose Representations

Naomi Saphra, Adam Lopez

Recent work in NLP shows that LSTM language models capture compositional structure in language data. For a closer look at how these representations are composed hierarchically, we present a novel measure of interdependence between word meanings in an LSTM, based on their interactions at the internal gates. To explore how compositional representations arise over training, we conduct simple experiments on synthetic data, which illustrate our measure by showing how high interdependence can hurt generalization. These synthetic experiments also illustrate a specific hypothesis about how hierarchical structures are discovered over the course of training: that parent constituents rely on effective representations of their children, rather than on learning long-range relations independently. We further support this measure with experiments on English language data, where interdependence is higher for more closely syntactically linked word pairs.

CLSep 30, 2019
Semantic Graph Parsing with Recurrent Neural Network DAG Grammars

Federico Fancellu, Sorcha Gilroy, Adam Lopez et al.

Semantic parses are directed acyclic graphs (DAGs), so semantic parsing should be modeled as graph prediction. But predicting graphs presents difficult technical challenges, so it is simpler and more common to predict the linearized graphs found in semantic parsing datasets using well-understood sequence models. The cost of this simplicity is that the predicted strings may not be well-formed graphs. We present recurrent neural network DAG grammars, a graph-aware sequence model that ensures only well-formed graphs while sidestepping many difficulties in graph prediction. We test our model on the Parallel Meaning Bank---a multilingual semantic graphbank. Our approach yields competitive results in English and establishes the first results for German, Italian and Dutch.

CLSep 6, 2019
A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Clara Vania, Yova Kementchedjhieva, Anders Søgaard et al.

Parsers are available for only a handful of the world's languages, since they require lots of training data. How far can we get with just a small amount of training data? We systematically compare a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration. Experimenting on three typologically diverse low-resource languages---North Sámi, Galician, and Kazah---We find that (1) when only the low-resource treebank is available, data augmentation is very helpful; (2) when a related high-resource treebank is available, cross-lingual training is helpful and complements data augmentation; and (3) when the high-resource treebank uses a different writing system, transliteration into a shared orthographic spaces is also very helpful.

CLAug 29, 2019
Cross-lingual topic prediction for speech using translations

Sameer Bansal, Herman Kamper, Adam Lopez et al.

Given a large amount of unannotated speech in a low-resource language, can we classify the speech utterances by topic? We consider this question in the setting where a small amount of speech in the low-resource language is paired with text translations in a high-resource language. We develop an effective cross-lingual topic classifier by training on just 20 hours of translated speech, using a recent model for direct speech-to-text translation. While the translations are poor, they are still good enough to correctly classify the topic of 1-minute speech segments over 70% of the time - a 20% improvement over a majority-class baseline. Such a system could be useful for humanitarian applications like crisis response, where incoming speech in a foreign low-resource language must be quickly assessed for further action.

CLJul 22, 2019
Sparsity Emerges Naturally in Neural Language Models

Naomi Saphra, Adam Lopez

Concerns about interpretability, computational resources, and principled inductive priors have motivated efforts to engineer sparse neural models for NLP tasks. If sparsity is important for NLP, might well-trained neural models naturally become roughly sparse? Using the Taxi-Euclidean norm to measure sparsity, we find that frequent input words are associated with concentrated or sparse activations, while frequent target words are associated with dispersed activations but concentrated gradients. We find that gradients associated with function words are more concentrated than the gradients of content words, even controlling for word frequency.

CLNov 1, 2018
Understanding Learning Dynamics Of Language Models with SVCCA

Naomi Saphra, Adam Lopez

Research has shown that neural models implicitly encode linguistic features, but there has been no research showing \emph{how} these encodings arise as the models are trained. We present the first study on the learning dynamics of neural language models, using a simple and flexible analysis method called Singular Vector Canonical Correlation Analysis (SVCCA), which enables us to compare learned representations across time and across models, without the need to evaluate directly on annotated data. We probe the evolution of syntactic, semantic, and topic representations and find that part-of-speech is learned earlier than topic; that recurrent layers become more similar to those of a tagger during training; and embedding layers less similar. Our results and methods could inform better learning algorithms for NLP models, possibly to incorporate linguistic information more effectively.

FLOct 29, 2018
The problem with probabilistic DAG automata for semantic graphs

Ieva Vasiljeva, Sorcha Gilroy, Adam Lopez

Semantic representations in the form of directed acyclic graphs (DAGs) have been introduced in recent years, and to model them, we need probabilistic models of DAGs. One model that has attracted some attention is the DAG automaton, but it has not been studied as a probabilistic model. We show that some DAG automata cannot be made into useful probabilistic models by the nearly universal strategy of assigning weights to transitions. The problem affects single-rooted, multi-rooted, and unbounded-degree variants of DAG automata, and appears to be pervasive. It does not affect planar variants, but these are problematic for other reasons.

CLOct 4, 2018
Neural Networks for Cross-lingual Negation Scope Detection

Federico Fancellu, Adam Lopez, Bonnie Webber

Negation scope has been annotated in several English and Chinese corpora, and highly accurate models for this task in these languages have been learned from these annotations. Unfortunately, annotations are not available in other languages. Could a model that detects negation scope be applied to a language that it hasn't been trained on? We develop neural models that learn from cross-lingual word embeddings or universal dependencies in English, and test them on Chinese, showing that they work surprisingly well. We find that modelling syntax is helpful even in monolingual settings and that cross-lingual word embeddings help relatively little, and we analyse cases that are still difficult for this task.

CLSep 5, 2018
Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

Sameer Bansal, Herman Kamper, Karen Livescu et al.

We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve Spanish-English ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English ST. Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1 BLEU.

CLAug 31, 2018
Indicatements that character language models learn English morpho-syntactic units and regularities

Yova Kementchedjhieva, Adam Lopez

Character language models have access to surface morphological patterns, but it is not clear whether or how they learn abstract morphological regularities. We instrument a character language model with several probes, finding that it can develop a specific unit to identify word boundaries and, by extension, morpheme boundaries, which allows it to capture linguistic properties and regularities of these units. Our language model proves surprisingly good at identifying the selectional restrictions of English derivational morphemes, a task that requires both morphological and syntactic awareness. Thus we conclude that, when morphemes overlap extensively with the words of a language, a character language model can perform morphological abstraction.

CLAug 28, 2018
What do character-level models learn about morphology? The case of dependency parsing

Clara Vania, Andreas Grivas, Adam Lopez

When parsing morphologically-rich languages with neural models, it is beneficial to model input at the character level, and it has been claimed that this is because character-level models learn morphology. We test these claims by comparing character-level models to an oracle with access to explicit morphological analysis on twelve languages with varying morphological typologies. Our results highlight many strengths of character-level models, but also show that they are poor at disambiguating some words, particularly in the face of case syncretism. We then demonstrate that explicitly modeling morphological case improves our best model, showing that character-level models can benefit from targeted forms of explicit morphological modeling.

CLMar 24, 2018
Low-Resource Speech-to-Text Translation

Sameer Bansal, Herman Kamper, Karen Livescu et al.

Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts needed to train a speech recognizer are usually not available for low-resource languages. Recent work has found that neural encoder-decoder models can learn to directly translate foreign speech in high-resource scenarios, without the need for intermediate transcription. We investigate whether this approach also works in settings where both data and computation are limited. To make the approach efficient, we make several architectural changes, including a change from character-level to word-level decoding. We find that this choice yields crucial speed improvements that allow us to train with fewer computational resources, yet still performs well on frequent words. We explore models trained on between 20 and 160 hours of data, and find that although models trained on less data have considerably lower BLEU scores, they can still predict words with relatively high precision and recall---around 50% for a model trained on 50 hours of data, versus around 60% for the full 160 hour model. Thus, they may still be useful for some low-resource scenarios.

CLAug 1, 2017
A Generative Parser with a Discriminative Recognition Algorithm

Jianpeng Cheng, Adam Lopez, Mirella Lapata

Generative models defining joint distributions over parse trees and sentences are useful for parsing and language modeling, but impose restrictions on the scope of features and are often outperformed by discriminative models. We propose a framework for parsing and language modeling which marries a generative model with a discriminative recognition model in an encoder-decoder setting. We provide interpretations of the framework based on expectation maximization and variational inference, and show that it enables parsing and language modeling within a single implementation. On the English Penn Treen-bank, our framework obtains competitive performance on constituency parsing while matching the state-of-the-art single-model language modeling score.

CLApr 26, 2017
From Characters to Words to in Between: Do We Capture Morphology?

Clara Vania, Adam Lopez

Words can be represented by composing the representations of subword units such as word segments, characters, and/or character n-grams. While such representations are effective and may capture the morphological regularities of words, they have not been systematically compared, and it is not understood how they interact with different morphological typologies. On a language modeling task, we present experiments that systematically vary (1) the basic unit of representation, (2) the composition of these representations, and (3) the morphological typology of the language modeled. Our results extend previous findings that character representations are effective across typologies, and we find that a previously unstudied combination of character trigram representations composed with bi-LSTMs outperforms most others. But we also find room for improvement: none of the character-level models match the predictive accuracy of a model with access to true morphological analyses, even when learned from an order of magnitude more data.

CLFeb 13, 2017
Towards speech-to-text translation without speech recognition

Sameer Bansal, Herman Kamper, Adam Lopez et al.

We explore the problem of translating speech to text in low-resource scenarios where neither automatic speech recognition (ASR) nor machine translation (MT) are available, but we have training data in the form of audio paired with text translations. We present the first system for this problem applied to a realistic multi-speaker dataset, the CALLHOME Spanish-English speech translation corpus. Our approach uses unsupervised term discovery (UTD) to cluster repeated patterns in the audio, creating a pseudotext, which we pair with translations to create a parallel text and train a simple bag-of-words MT model. We identify the challenges faced by the system, finding that the difficulty of cross-speaker UTD results in low recall, but that our system is still able to correctly translate some content words in test data.

CLFeb 10, 2017
Universal Dependencies to Logical Forms with Negation Scope

Federico Fancellu, Siva Reddy, Adam Lopez et al.

Many language technology applications would benefit from the ability to represent negation and its scope on top of widely-used linguistic resources. In this paper, we investigate the possibility of obtaining a first-order logic representation with negation scope marked using Universal Dependencies. To do so, we enhance UDepLambda, a framework that converts dependency graphs to logical forms. The resulting UDepLambda$\lnot$ is able to handle phenomena related to scope by means of an higher-order type theory, relevant not only to negation but also to universal quantification and other complex semantic phenomena. The initial conversion we did for English is promising, in that one can represent the scope of negation also in the presence of more complex phenomena such as universal quantifiers.

CLSep 21, 2016
Weakly supervised spoken term discovery using cross-lingual side information

Sameer Bansal, Herman Kamper, Sharon Goldwater et al.

Recent work on unsupervised term discovery (UTD) aims to identify and cluster repeated word-like units from audio alone. These systems are promising for some very low-resource languages where transcribed audio is unavailable, or where no written form of the language exists. However, in some cases it may still be feasible (e.g., through crowdsourcing) to obtain (possibly noisy) text translations of the audio. If so, this information could be used as a source of side information to improve UTD. Here, we present a simple method for rescoring the output of a UTD system using text translations, and test it on a corpus of Spanish audio with English translations. We show that it greatly improves the average precision of the results over a wide range of system configurations and data preprocessing methods.

CLJun 27, 2016
Evaluating Informal-Domain Word Representations With UrbanDictionary

Naomi Saphra, Adam Lopez

Existing corpora for intrinsic evaluation are not targeted towards tasks in informal domains such as Twitter or news comment forums. We want to test whether a representation of informal words fulfills the promise of eliding explicit text normalization as a preprocessing step. One possible evaluation metric for such domains is the proximity of spelling variants. We propose how such a metric might be computed and how a spelling variant dataset can be collected using UrbanDictionary.