Heike Adel

CL
h-index70
50papers
15,116citations
Novelty46%
AI Score37

50 Papers

CLApr 28, 2023
NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

Mingyang Wang, Heike Adel, Lukas Lange et al.

This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African Languages using Twitter Dataset". Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.

LGOct 23, 2023
GradSim: Gradient-Based Language Grouping for Effective Multilingual Training

Mingyang Wang, Heike Adel, Lukas Lange et al.

Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteristics or data distributions are not compatible. In this paper, we propose GradSim, a language grouping method based on gradient similarity. Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains compared to other similarity measures and it is better correlated with cross-lingual model performance. As a result, we set the new state of the art on AfriSenti, a benchmark dataset for sentiment analysis on low-resource African languages. In our extensive analysis, we further reveal that besides linguistic features, the topics of the datasets play an important role for language grouping and that lower layers of transformer models encode language-specific features while higher layers capture task-specific information.

CLMay 20, 2022
Multilingual Normalization of Temporal Expressions with Masked Language Models

Lukas Lange, Jannik Strötgen, Heike Adel et al.

The detection and normalization of temporal expressions is an important task and preprocessing step for many applications. However, prior work on normalization is rule-based, which severely limits the applicability in real-world multilingual settings, due to the costly creation of new rules. We propose a novel neural method for normalizing temporal expressions based on masked language modeling. Our multilingual method outperforms prior rule-based systems in many languages, and in particular, for low-resource languages with performance improvements of up to 33 F1 on average compared to the state of the art.

CLFeb 14, 2023
SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains

Koustava Goswami, Lukas Lange, Jun Araki et al.

Prompting pre-trained language models leads to promising results across natural language processing tasks but is less effective when applied in low-resource domains, due to the domain gap between the pre-training data and the downstream task. In this work, we bridge this gap with a novel and lightweight prompting methodology called SwitchPrompt for the adaptation of language models trained on datasets from the general domain to diverse low-resource domains. Using domain-specific keywords with a trainable gated prompt, SwitchPrompt offers domain-oriented prompting, that is, effective guidance on the target domains for general-domain language models. Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt. They often even outperform their domain-specific counterparts trained with baseline state-of-the-art prompting methods by up to 10.7% performance increase in accuracy. This result indicates that SwitchPrompt effectively reduces the need for domain-specific language model pre-training.

CLOct 13, 2022
Challenges in Explanation Quality Evaluation

Hendrik Schuff, Heike Adel, Peng Qi et al.

While much research focused on producing explanations, it is still unclear how the produced explanations' quality can be evaluated in a meaningful way. Today's predominant approach is to quantify explanations using proxy scores which compare explanations to (human-annotated) gold explanations. This approach assumes that explanations which reach higher proxy scores will also provide a greater benefit to human users. In this paper, we present problems of this approach. Concretely, we (i) formulate desired characteristics of explanation quality, (ii) describe how current evaluation practices violate them, and (iii) support our argumentation with initial evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems. We find that proxy scores correlate poorly with human quality ratings and, additionally, become less expressive the more often they are used (i.e. following Goodhart's law). Finally, we propose guidelines to enable a meaningful evaluation of explanations to drive the development of systems that provide tangible benefits to human users.

CLSep 12, 2024
Learning Rules from KGs Guided by Language Models

Zihang Peng, Daria Stepanova, Vinh Thinh Ho et al.

Advances in information extraction have enabled the automatic construction of large knowledge graphs (e.g., Yago, Wikidata or Google KG), which are widely used in many applications like semantic search or data analytics. However, due to their semi-automatic construction, KGs are often incomplete. Rule learning methods, concerned with the extraction of frequent patterns from KGs and casting them into rules, can be applied to predict potentially missing facts. A crucial step in this process is rule ranking. Ranking of rules is especially challenging over highly incomplete or biased KGs (e.g., KGs predominantly storing facts about famous people), as in this case biased rules might fit the data best and be ranked at the top based on standard statistical metrics like rule confidence. To address this issue, prior works proposed to rank rules not only relying on the original KG but also facts predicted by a KG embedding model. At the same time, with the recent rise of Language Models (LMs), several works have claimed that LMs can be used as alternative means for KG completion. In this work, our goal is to verify to which extent the exploitation of LMs is helpful for improving the quality of rule learning systems.

LGMar 31, 2024
Rehearsal-Free Modular and Compositional Continual Learning for Language Models

Mingyang Wang, Heike Adel, Lukas Lange et al.

Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free Modular and Compositional Continual Learning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer.

CLApr 5, 2025
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models

Mingyang Wang, Heike Adel, Lukas Lange et al.

Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.

CLApr 29, 2024
FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering

Wei Zhou, Mohsen Mesgar, Heike Adel et al.

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

CLDec 28, 2024
Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Wei Zhou, Mohsen Mesgar, Annemarie Friedrich et al.

Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA.

CLMay 20, 2025
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes

Mingyang Wang, Lukas Lange, Heike Adel et al.

Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model's internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.

CLFeb 18, 2025
Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion

Mingyang Wang, Alisa Stoll, Lukas Lange et al.

Adapting large language models (LLMs) to new and diverse knowledge is essential for their lasting effectiveness in real-world applications. This survey provides an overview of state-of-the-art methods for expanding the knowledge of LLMs, focusing on integrating various knowledge types, including factual information, domain expertise, language proficiency, and user preferences. We explore techniques, such as continual learning, model editing, and retrieval-based explicit adaptation, while discussing challenges like knowledge consistency and scalability. Designed as a guide for researchers and practitioners, this survey sheds light on opportunities for advancing LLMs as adaptable and robust knowledge systems.

CLMar 8, 2024
Explaining Pre-Trained Language Models with Attribution Scores: An Analysis in Low-Resource Settings

Wei Zhou, Heike Adel, Hendrik Schuff et al.

Attribution scores indicate the importance of different input parts and can, thus, explain model behaviour. Currently, prompt-based models are gaining popularity, i.a., due to their easier adaptability in low-resource settings. However, the quality of attribution scores extracted from prompt-based models has not been investigated yet. In this work, we address this topic by analyzing attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness and comparing them with attribution scores extracted from fine-tuned models and large language models. In contrast to previous work, we introduce training size as another dimension into the analysis. We find that using the prompting paradigm (with either encoder-based or decoder-based models) yields more plausible explanations than fine-tuning the models in low-resource settings and Shapley Value Sampling consistently outperforms attention and Integrated Gradients in terms of leading to more plausible and faithful explanations.

CLMay 20, 2025
Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

Wei Zhou, Mohsen Mesgar, Heike Adel et al.

In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

CLMay 23, 2025
p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

Wei Zhou, Mohsen Mesgar, Heike Adel et al.

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p2-TQA, a process-based preference learning framework for TQA post-training. p2-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p2-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p2-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

CVDec 12, 2024
Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

Philipp Moeßner, Heike Adel

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt's level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

LGJun 26, 2024
Learn it or Leave it: Module Composition and Pruning for Continual Learning

Mingyang Wang, Heike Adel, Lukas Lange et al.

In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.

CLMay 4, 2023
Neighboring Words Affect Human Interpretation of Saliency Explanations

Alon Jacovi, Hendrik Schuff, Heike Adel et al.

Word-level saliency explanations ("heat maps over words") are often used to communicate feature-attribution in text-based models. Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores. We conduct a user study to investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation. We find that neighboring words have significant effects on the word's importance rating. Concretely, we identify that the influence changes based on neighboring direction (left vs. right) and a-priori linguistic and computational measures of phrases and collocations (vs. unrelated neighboring words). Our results question whether text-based saliency explanations should be continued to be communicated at word level, and inform future research on alternative saliency explanation methods.

CLJan 27, 2022
Human Interpretation of Saliency-based Explanation Over Text

Hendrik Schuff, Alon Jacovi, Heike Adel et al.

While a lot of research in explainable AI focuses on producing effective explanations, less work is devoted to the question of how people understand and interpret the explanation. In this work, we focus on this question through a study of saliency-based explanations over textual data. Feature-attribution explanations of text models aim to communicate which parts of the input text were more influential than others towards the model decision. Many current explanation methods, such as gradient-based or Shapley value-based methods, provide measures of importance which are well-understood mathematically. But how does a person receiving the explanation (the explainee) comprehend it? And does their understanding match what the explanation attempted to communicate? We empirically investigate the effect of various factors of the input, the feature-attribution explanation, and visualization procedure, on laypeople's interpretation of the explanation. We query crowdworkers for their interpretation on tasks in English and German, and fit a GAMM model to their responses considering the factors of interest. We find that people often mis-interpret the explanations: superficial and unrelated factors, such as word length, influence the explainees' importance assignment despite the explanation communicating importance directly. We then show that some of this distortion can be attenuated: we propose a method to adjust saliencies based on model estimates of over- and under-perception, and explore bar charts as an alternative to heatmap saliency visualization. We find that both approaches can attenuate the distorting effect of specific factors, leading to better-calibrated understanding of the explanation.

CLDec 16, 2021
CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain

Lukas Lange, Heike Adel, Jannik Strötgen et al.

The field of natural language processing (NLP) has recently seen a large change towards using pre-trained language models for solving almost any task. Despite showing great improvements in benchmark datasets for various tasks, these models often perform sub-optimal in non-standard domains like the clinical domain where a large gap between pre-training documents and target documents is observed. In this paper, we aim at closing this gap with domain-specific training of the language model and we investigate its effect on a diverse set of downstream tasks and settings. We introduce the pre-trained CLIN-X (Clinical XLM-R) language models and show how CLIN-X outperforms other pre-trained transformer models by a large margin for ten clinical concept extraction tasks from two languages. In addition, we demonstrate how the transformer model can be further improved with our proposed task- and language-agnostic model architecture based on ensembles over random splits and cross-sentence context. Our studies in low-resource and transfer settings reveal stable model performance despite a lack of annotated data with improvements of up to 47 F1 points when only 250 labeled sentences are available. Our results highlight the importance of specialized language models as CLIN-X for concept extraction in non-standard domains, but also show that our task-agnostic model architecture is robust across the tested tasks and languages so that domain- or task-specific adaptations are not required.

CLSep 17, 2021
Boosting Transformers for Job Expression Extraction and Classification in a Low-Resource Setting

Lukas Lange, Heike Adel, Jannik Strötgen

In this paper, we explore possible improvements of transformer models in a low-resource setting. In particular, we present our approaches to tackle the first two of three subtasks of the MEDDOPROF competition, i.e., the extraction and classification of job expressions in Spanish clinical texts. As neither language nor domain experts, we experiment with the multilingual XLM-R transformer model and tackle these low-resource information extraction tasks as sequence-labeling problems. We explore domain- and language-adaptive pretraining, transfer learning and strategic datasplits to boost the transformer model. Our results show strong improvements using these methods by up to 5.3 F1 points compared to a fine-tuned XLM-R model. Our best models achieve 83.2 and 79.3 F1 for the first two tasks, respectively.

CLSep 16, 2021
Does External Knowledge Help Explainable Natural Language Inference? Automatic Evaluation vs. Human Ratings

Hendrik Schuff, Hsiu-Yu Yang, Heike Adel et al.

Natural language inference (NLI) requires models to learn and apply commonsense knowledge. These reasoning abilities are particularly important for explainable NLI systems that generate a natural language explanation in addition to their label prediction. The integration of external knowledge has been shown to improve NLI systems, here we investigate whether it can also improve their explanation capabilities. For this, we investigate different sources of external knowledge and evaluate the performance of our models on in-domain data as well as on special transfer datasets that are designed to assess fine-grained reasoning capabilities. We find that different sources of knowledge have a different effect on reasoning abilities, for example, implicit knowledge stored in language models can hinder reasoning on numbers and negations. Finally, we conduct the largest and most fine-grained explainable NLI crowdsourcing study to date. It reveals that even large differences in automatic performance scores do neither reflect in human ratings of label, explanation, commonsense nor grammar correctness.

LGJul 26, 2021
Thought Flow Nets: From Single Predictions to Trains of Model Thought

Hendrik Schuff, Heike Adel, Ngoc Thang Vu

When humans solve complex problems, they typically create a sequence of ideas (involving an intuitive decision, reflection, error correction, etc.) in order to reach a conclusive decision. Contrary to this, today's models are mostly trained to map an input to one single and fixed output. In this paper, we investigate how we can give models the opportunity of a second, third and $k$-th thought. Taking inspiration from Hegel's dialectics, we propose the concept of a thought flow which creates a sequence of predictions. We present a self-correction mechanism that is trained to estimate the model's correctness and performs iterative prediction updates based on the correctness prediction's gradient. We introduce our method at the example of question answering and conduct extensive experiments that demonstrate (i) our method's ability to correct its own predictions and (ii) its potential to notably improve model performances. In addition, we conduct a qualitative analysis of thought flow correction patterns and explore how thought flow predictions affect human users within a crowdsourcing study. We find that (iii) thought flows enable improved user performance and are perceived as more natural, correct, and intelligent as single and/or top-3 predictions.

CLApr 22, 2021
Enriched Attention for Robust Relation Extraction

Heike Adel, Jannik Strötgen

The performance of relation extraction models has increased considerably with the rise of neural networks. However, a key issue of neural relation extraction is robustness: the models do not scale well to long sentences with multiple entities and relations. In this work, we address this problem with an enriched attention mechanism. Attention allows the model to focus on parts of the input sentence that are relevant to relation extraction. We propose to enrich the attention function with features modeling knowledge about the relation arguments and the shortest dependency path between them. Thus, for different relation arguments, the model can pay attention to different parts of the sentence. Our model outperforms prior work using comparable setups on two popular benchmarks, and our analysis confirms that it indeed scales to long sentences with many entities.

CLApr 16, 2021
To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning

Lukas Lange, Jannik Strötgen, Heike Adel et al.

In low-resource settings, model transfer can help to overcome a lack of labeled data for many tasks and domains. However, predicting useful transfer sources is a challenging problem, as even the most similar sources might lead to unexpected negative transfer results. Thus, ranking methods based on task and text similarity -- as suggested in prior work -- may not be sufficient to identify promising sources. To tackle this problem, we propose a new approach to automatically determine which and how many sources should be exploited. For this, we study the effects of model transfer on sequence labeling across various domains and tasks and show that our methods based on model similarity and support vector machines are able to predict promising sources, resulting in performance increases of up to 24 F1 points.

CLOct 23, 2020
NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction

Lukas Lange, Xiang Dai, Heike Adel et al.

The recognition and normalization of clinical information, such as tumor morphology mentions, is an important, but complex process consisting of multiple subtasks. In this paper, we describe our system for the CANTEMIST shared task, which is able to extract, normalize and rank ICD codes from Spanish electronic health records using neural sequence labeling and parsing approaches with context-aware embeddings. Our best system achieves 85.3 F1, 76.7 F1, and 77.0 MAP for the three tasks, respectively.

CLOct 23, 2020
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Michael A. Hedderich, Lukas Lange, Heike Adel et al.

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

CLOct 23, 2020
FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations

Lukas Lange, Heike Adel, Jannik Strötgen et al.

Combining several embeddings typically improves performance in downstream tasks as different embeddings encode different information. It has been shown that even models using embeddings from transformers still benefit from the inclusion of standard word embeddings. However, the combination of embeddings of different types and dimensions is challenging. As an alternative to attention-based meta-embeddings, we propose feature-based adversarial meta-embeddings (FAME) with an attention function that is guided by features reflecting word-specific properties, such as shape and frequency, and show that this is beneficial to handle subword-based embeddings. In addition, FAME uses adversarial training to optimize the mappings of differently-sized embeddings to the same space. We demonstrate that FAME works effectively across languages and domains for sequence labeling and sentence classification, in particular in low-resource settings. FAME sets the new state of the art for POS tagging in 27 languages, various NER settings and question classification in different domains.

CLOct 22, 2020
An Analysis of Simple Data Augmentation for Named Entity Recognition

Xiang Dai, Heike Adel

Simple yet effective data augmentation techniques have been proposed for sentence-level and sentence-pair natural language processing tasks. Inspired by these efforts, we design and compare data augmentation for named entity recognition, which is usually modeled as a token-level sequence labeling problem. Through experiments on two data sets from the biomedical and materials science domains (i2b2-2010 and MaSciP), we show that simple augmentation can boost performance for both recurrent and transformer-based models, especially for small training sets.

CLOct 13, 2020
F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Hendrik Schuff, Heike Adel, Ngoc Thang Vu

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.

CLJul 2, 2020
NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification

Lukas Lange, Heike Adel, Jannik Strötgen

Natural language processing has huge potential in the medical domain which recently led to a lot of research in this field. However, a prerequisite of secure processing of medical documents, e.g., patient notes and clinical trials, is the proper de-identification of privacy-sensitive information. In this paper, we describe our NLNDE system, with which we participated in the MEDDOCAN competition, the medical document anonymization task of IberLEF 2019. We address the task of detecting and classifying protected health information from Spanish data as a sequence-labeling problem and investigate different embedding methods for our neural network. Despite dealing in a non-standard language and domain setting, the NLNDE system achieves promising results in the competition.

CLJul 2, 2020
NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection

Lukas Lange, Heike Adel, Jannik Strötgen

Named entity recognition has been extensively studied on English news texts. However, the transfer to other domains and languages is still a challenging problem. In this paper, we describe the system with which we participated in the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the task provides a non-standard domain and language setting. However, we propose an architecture that requires neither language nor domain expertise. We treat the task as a sequence labeling task and experiment with attention-based embedding selection and the training on automatically annotated data to further improve our system's performance. Our system achieves promising results, especially by combining the different techniques, and reaches up to 88.6% F1 in the competition.

CLJun 4, 2020
The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

Annemarie Friedrich, Heike Adel, Federico Tomazic et al.

This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.

CLMay 19, 2020
Closing the Gap: Joint De-Identification and Concept Extraction in the Clinical Domain

Lukas Lange, Heike Adel, Jannik Strötgen

Exploiting natural language processing in the clinical domain requires de-identification, i.e., anonymization of personal information in texts. However, current research considers de-identification and downstream tasks, such as concept extraction, only in isolation and does not study the effects of de-identification on other tasks. In this paper, we close this gap by reporting concept extraction performance on automatically anonymized data and investigating joint models for de-identification and concept extraction. In particular, we propose a stacked model with restricted access to privacy-sensitive information and a multitask model. We set the new state of the art on benchmark datasets in English (96.1% F1 for de-identification and 88.9% F1 for concept extraction) and Spanish (91.4% F1 for concept extraction).

CLMay 19, 2020
Adversarial Alignment of Multilingual Models for Extracting Temporal Expressions from Text

Lukas Lange, Anastasiia Iurshina, Heike Adel et al.

Although temporal tagging is still dominated by rule-based systems, there have been recent attempts at neural temporal taggers. However, all of them focus on monolingual settings. In this paper, we explore multilingual methods for the extraction of temporal expressions from text and investigate adversarial training for aligning embedding spaces to one common space. With this, we create a single multilingual model that can also be transferred to unseen languages and set the new state of the art in those cross-lingual transfer experiments.

CLMay 19, 2020
On the Choice of Auxiliary Languages for Improved Sequence Tagging

Lukas Lange, Heike Adel, Jannik Strötgen

Recent work showed that embeddings from related languages can improve the performance of sequence tagging, even for monolingual models. In this analysis paper, we investigate whether the best auxiliary language can be predicted based on language distances and show that the most related language is not always the best auxiliary language. Further, we show that attention-based meta-embeddings can effectively combine pre-trained embeddings from different languages for sequence tagging and set new state-of-the-art results for part-of-speech tagging in five languages.

CLOct 1, 2019
Type-aware Convolutional Neural Networks for Slot Filling

Heike Adel, Hinrich Schütze

The slot filling task aims at extracting answers for queries about entities from text, such as "Who founded Apple". In this paper, we focus on the relation classification component of a slot filling system. We propose type-aware convolutional neural networks to benefit from the mutual dependencies between entity and relation classification. In particular, we explore different ways of integrating the named entity types of the relation arguments into a neural network for relation classification, including a joint training and a structured prediction approach. To the best of our knowledge, this is the first study on type-aware neural networks for slot filling. The type-aware models lead to the best results of our slot filling pipeline. Joint training performs comparable to structured prediction. To understand the impact of the different components of the slot filling pipeline, we perform a recall analysis, a manual error analysis and several ablation studies. Such analyses are of particular importance to other slot filling researchers since the official slot filling evaluations only assess pipeline outputs. The analyses show that especially coreference resolution and our convolutional neural networks have a large positive impact on the final performance of the slot filling pipeline. The presented models, the source code of our system as well as our coreference resource is publicy available.

CLFeb 28, 2019
Adversarial Training for Satire Detection: Controlling for Confounding Variables

Robert McHardy, Heike Adel, Roman Klinger

The automatic detection of satire vs. regular news is relevant for downstream applications (for instance, knowledge base population) and to improve the understanding of linguistic characteristics of satire. Recent approaches build upon corpora which have been labeled automatically based on article sources. We hypothesize that this encourages the models to learn characteristics for different publication sources (e.g., "The Onion" vs. "The Guardian") rather than characteristics of satire, leading to poor generalization performance to unseen publication sources. We therefore propose a novel model for satire detection with an adversarial component to control for the confounding variable of publication source. On a large novel data set collected from German news (which we make available to the research community), we observe comparable satire classification performance and, as desired, a considerable drop in publication classification performance with adversarial training. Our analysis shows that the adversarial component is crucial for the model to learn to pay attention to linguistic properties of satire.

CLNov 6, 2018
CIS at TAC Cold Start 2015: Neural Networks and Coreference Resolution for Slot Filling

Heike Adel, Hinrich Schütze

This paper describes the CIS slot filling system for the TAC Cold Start evaluations 2015. It extends and improves the system we have built for the evaluation last year. This paper mainly describes the changes to our last year's system. Especially, it focuses on the coreference and classification component. For coreference, we have performed several analysis and prepared a resource to simplify our end-to-end system and improve its runtime. For classification, we propose to use neural networks. We have trained convolutional and recurrent neural networks and combined them with traditional evaluation methods, namely patterns and support vector machines. Our runs for the 2015 evaluation have been designed to directly assess the effect of each network on the end-to-end performance of the system. The CIS system achieved rank 3 of all slot filling systems participating in the task.

CLAug 14, 2018
Adversarial Neural Networks for Cross-lingual Sequence Tagging

Heike Adel, Anton Bryl, David Weiss et al.

We study cross-lingual sequence tagging with little or no labeled data in the target language. Adversarial training has previously been shown to be effective for training cross-lingual sentence classifiers. However, it is not clear if language-agnostic representations enforced by an adversarial language discriminator will also enable effective transfer for token-level prediction tasks. Therefore, we experiment with different types of adversarial training on two tasks: dependency parsing and sentence compression. We show that adversarial training consistently leads to improved cross-lingual performance on each task compared to a conventionally trained baseline.

CLAug 13, 2018
Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

Apostolos Kemos, Heike Adel, Hinrich Schütze

Character-level models of tokens have been shown to be effective at dealing with within-token noise and out-of-vocabulary words. But these models still rely on correct token boundaries. In this paper, we propose a novel end-to-end character-level model and demonstrate its effectiveness in multilingual settings and when token boundaries are noisy. Our model is a semi-Markov conditional random field with neural networks for character and segment representation. It requires no tokenizer. The model matches state-of-the-art baselines for various languages and significantly outperforms them on a noisy English version of a part-of-speech tagging benchmark dataset. Our code and the noisy dataset are publicly available at http://cistern.cis.lmu.de/semiCRF.

CLOct 26, 2017
Impact of Coreference Resolution on Slot Filling

Heike Adel, Hinrich Schütze

In this paper, we demonstrate the importance of coreference resolution for natural language processing on the example of the TAC Slot Filling shared task. We illustrate the strengths and weaknesses of automatic coreference resolution systems and provide experimental results to show that they improve performance in the slot filling end-to-end setting. Finally, we publish KBPchains, a resource containing automatically extracted coreference chains from the TAC source corpus in order to support other researchers working on this topic.

CLOct 4, 2017
Syntactic and Semantic Features For Code-Switching Factored Language Models

Heike Adel, Ngoc Thang Vu, Katrin Kirchhoff et al.

This paper presents our latest investigations on different features for factored language models for Code-Switching speech and their effect on automatic speech recognition (ASR) performance. We focus on syntactic and semantic features which can be extracted from Code-Switching text data and integrate them into factored language models. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and clusters of open class word embeddings are explored. The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. In ASR experiments, the model containing Brown word clusters and part-of-speech tags and the model also including clusters of open class word embeddings yield the best mixed error rate results. In summary, the best language model can significantly reduce the perplexity on the SEAME evaluation set by up to 10.8% relative and the mixed error rate by up to 3.4% relative.

CLAug 7, 2017
Corpus-level Fine-grained Entity Typing

Yadollah Yaghoobzadeh, Heike Adel, Hinrich Schütze

This paper addresses the problem of corpus-level entity typing, i.e., inferring from a large corpus that an entity is a member of a class such as "food" or "artist". The application of entity typing we are interested in is knowledge base completion, specifically, to learn which classes an entity is a member of. We propose FIGMENT to tackle this problem. FIGMENT is embedding- based and combines (i) a global model that scores based on aggregated contextual information of an entity and (ii) a context model that first scores the individual occurrences of an entity and then aggregates the scores. Each of the two proposed models has some specific properties. For the global model, learning high quality entity representations is crucial because it is the only source used for the predictions. Therefore, we introduce representations using name and contexts of entities on the three levels of entity, word, and character. We show each has complementary information and a multi-level representation is the best. For the context model, we need to use distant supervision since the context-level labels are not available for entities. Distant supervised labels are noisy and this harms the performance of models. Therefore, we introduce and apply new algorithms for noise mitigation using multi-instance learning. We show the effectiveness of our models in a large entity typing dataset, built from Freebase.

CLJul 24, 2017
Global Normalization of Convolutional Neural Networks for Joint Entity and Relation Classification

Heike Adel, Hinrich Schütze

We introduce globally normalized convolutional neural networks for joint entity classification and relation extraction. In particular, we propose a way to utilize a linear-chain conditional random field output layer for predicting entity types and relations between entities at the same time. Our experiments show that global normalization outperforms a locally normalized softmax layer on a benchmark dataset.

CLDec 22, 2016
Noise Mitigation for Neural Entity Typing and Relation Extraction

Yadollah Yaghoobzadeh, Heike Adel, Hinrich Schütze

In this paper, we address two different types of noise in information extraction models: noise from distant supervision and noise from pipeline input features. Our target tasks are entity typing and relation extraction. For the first noise type, we introduce multi-instance multi-label learning algorithms using neural network models, and apply them to fine-grained entity typing for the first time. This gives our models comparable performance with the state-of-the-art supervised approach which uses global embeddings of entities. For the second noise type, we propose ways to improve the integration of noisy entity type predictions into relation extraction. Our experiments show that probabilistic predictions are more robust than discrete predictions and that joint training of the two tasks performs best.

CLDec 20, 2016
Exploring Different Dimensions of Attention for Uncertainty Detection

Heike Adel, Hinrich Schütze

Neural networks with attention have proven effective for many natural language processing tasks. In this paper, we develop attention mechanisms for uncertainty detection. In particular, we generalize standardly used attention mechanisms by introducing external attention and sequence-preserving attention. These novel architectures differ from standard approaches in that they use external resources to compute attention weights and preserve sequence information. We compare them to other configurations along different dimensions of attention. Our novel architectures set the new state of the art on a Wikipedia benchmark dataset and perform similar to the state-of-the-art model on a biomedical benchmark which uses a large set of linguistic features.

CLOct 3, 2016
Nonsymbolic Text Representation

Hinrich Schuetze, Heike Adel, Ehsaneddin Asgari

We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words or other symbolic units in text. This applies to training the parameters of the model on a training corpus as well as to applying it when computing the representation of a new text. We show that our model performs better than prior work on an information extraction and a text denoising task.

CLMay 24, 2016
Combining Recurrent and Convolutional Neural Networks for Relation Classification

Ngoc Thang Vu, Heike Adel, Pankaj Gupta et al.

This paper investigates two different neural architectures for the task of relation classification: convolutional neural networks and recurrent neural networks. For both models, we demonstrate the effect of different architectural choices. We present a new context representation for convolutional neural networks for relation classification (extended middle context). Furthermore, we propose connectionist bi-directional recurrent neural networks and introduce ranking loss for their optimization. Finally, we show that combining convolutional and recurrent neural networks using a simple voting scheme is accurate enough to improve results. Our neural models achieve state-of-the-art results on the SemEval 2010 relation classification task.

CLMar 16, 2016
Comparing Convolutional Neural Networks to Traditional Models for Slot Filling

Heike Adel, Benjamin Roth, Hinrich Schütze

We address relation classification in the context of slot filling, the task of finding and evaluating fillers like "Steve Jobs" for the slot X in "X founded Apple". We propose a convolutional neural network which splits the input sentence into three parts according to the relation arguments and compare it to state-of-the-art and traditional approaches of relation classification. Finally, we combine different methods and show that the combination is better than individual approaches. We also analyze the effect of genre differences on performance.