Sarvnaz Karimi

h-index26

18papers

6,794citations

Novelty26%

AI Score40

Ranked #73,659 of 194,257 authors (top 38%)#14,144 in CL (top 46%)

18 Papers

2.7CLDec 1, 2025

CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

Liangji Kong, Aditya Joshi, Sarvnaz Karimi

Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

23.9CLNov 24, 2022

Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods

Xiang Dai, Sarvnaz Karimi

Information Extraction from scientific literature can be challenging due to the highly specialised nature of such text. We describe our entity recognition methods developed as part of the DEAL (Detecting Entities in the Astrophysics Literature) shared task. The aim of the task is to build a system that can identify Named Entities in a dataset composed by scholarly articles from astrophysics literature. We planned our participation such that it enables us to conduct an empirical comparison between word-based tagging and span-based classification methods. When evaluated on two hidden test sets provided by the organizer, our best-performing submission achieved $F_1$ scores of 0.8307 (validation phase) and 0.7990 (testing phase).

13.8CLSep 29, 2024

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

Xiang Dai, Sarvnaz Karimi, Biaoyan Fang

Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.

1.0CLMar 15, 2024

Identifying Health Risks from Family History: A Survey of Natural Language Processing Techniques

Xiang Dai, Sarvnaz Karimi, Nathan O'Callaghan

Electronic health records include information on patients' status and medical history, which could cover the history of diseases and disorders that could be hereditary. One important use of family history information is in precision health, where the goal is to keep the population healthy with preventative measures. Natural Language Processing (NLP) and machine learning techniques can assist with identifying information that could assist health professionals in identifying health risks before a condition is developed in their later years, saving lives and reducing healthcare costs. We survey the literature on the techniques from the NLP field that have been developed to utilise digital health records to identify risks of familial diseases. We highlight that rule-based methods are heavily investigated and are still actively used for family history extraction. Still, more recent efforts have been put into building neural models based on large-scale pre-trained language models. In addition to the areas where NLP has successfully been utilised, we also identify the areas where more research is needed to unlock the value of patients' records regarding data collection, task formulation and downstream applications.

6.7CLAug 2, 2025

CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages

Jiyu Chen, Necva Bölücü, Sarvnaz Karimi et al.

Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The \textit{Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion} shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language.

1.0CLDec 16, 2024

Can AI Extract Antecedent Factors of Human Trust in AI? An Application of Information Extraction for Scientific Literature in Behavioural and Computer Sciences

Melanie McGrath, Harrison Bailey, Necva Bölücü et al.

Information extraction from the scientific literature is one of the main techniques to transform unstructured knowledge hidden in the text into structured data which can then be used for decision-making in down-stream tasks. One such area is Trust in AI, where factors contributing to human trust in artificial intelligence applications are studied. The relationships of these factors with human trust in such applications are complex. We hence explore this space from the lens of information extraction where, with the input of domain experts, we carefully design annotation guidelines, create the first annotated English dataset in this domain, investigate an LLM-guided annotation, and benchmark it with state-of-the-art methods using large language models in named entity and relation extraction. Our results indicate that this problem requires supervised learning which may not be currently feasible with prompt-based LLMs.

2.3AIOct 22, 2024

AskBeacon -- Performing genomic data exchange and analytics with natural language

Anuradha Wickramarachchi, Shakila Tonni, Sonali Majumdar et al.

Enabling clinicians and researchers to directly interact with global genomic data resources by removing technological barriers is vital for medical genomics. AskBeacon enables Large Language Models to be applied to securely shared cohorts via the GA4GH Beacon protocol. By simply "asking" Beacon, actionable insights can be gained, analyzed and made publication-ready.

31.2CLOct 2, 2020

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Xiang Dai, Sarvnaz Karimi, Ben Hachey et al.

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

4.3IRJul 6, 2020

Searching Scientific Literature for Answers on COVID-19 Questions

Vincent Nguyen, Maciek Rybinski, Sarvnaz Karimi et al.

Finding answers related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. TREC COVID search track aims to assist in creating search tools to aid scientists, clinicians, policy makers and others with similar information needs in finding reliable answers from the scientific literature. We experiment with different ranking algorithms as part of our participation in this challenge. We propose a novel method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search.

31.3CLApr 28, 2020Code

An Effective Transition-based Model for Discontinuous NER

Xiang Dai, Sarvnaz Karimi, Ben Hachey et al.

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

1.2SPDec 24, 2019

Comparison of the P300 detection accuracy related to the BCI speller and image recognition scenarios

S. A. Karimi, A. M. Mijani, M. T. Talebian et al.

There are several protocols in the Electroencephalography (EEG) recording scenarios which produce various types of event-related potentials (ERP). P300 pattern is a well-known ERP which produced by auditory and visual oddball paradigm and BCI speller system. In this study, P300 and non-P300 separability are investigated in two scenarios including image recognition paradigm and BCI speller. Image recognition scenario is an experiment that examines the participants, knowledge about an image that shown to them before by analyzing the EEG signal recorded during the observing of that image as visual stimulation. To do this, three types of famous classifiers (SVM, Bayes LDA, and sparse logistic regression) were used to classify EEG recordings in six classes problem. Filtered and down-sampled (temporal samples) of EEG recording were considered as features in classification P300 pattern. Also, different sets of EEG recording including 4, 8 and 16 channels and different trial numbers were used to considering various situations in comparison. The accuracy was increased by increasing the number of trials and channels. The results prove that better accuracy is observed in the case of the image recognition scenario for the different sets of channels and by using the different number of trials. So it can be concluded that P300 pattern which produced in image recognition paradigm is more separable than BCI (matrix speller).

31.0CLJun 13, 2019

A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics

Aditya Joshi, Sarvnaz Karimi, Ross Sparks et al.

Distributed representations of text can be used as features when training a statistical classifier. These representations may be created as a composition of word vectors or as context-based sentence vectors. We compare the two kinds of representations (word versus context) for three classification problems: influenza infection classification, drug usage classification and personal health mention classification. For statistical classifiers trained for each of these problems, context-based representations based on ELMo, Universal Sentence Encoder, Neural-Net Language Model and FLAIR are better than Word2Vec, GloVe and the two adapted using the MESH ontology. There is an improvement of 2-4% in the accuracy when these context-based representations are used instead of word-based representations.

31.1CLJun 13, 2019

Figurative Usage Detection of Symptom Words to Improve Personal Health Mention Detection

Adith Iyer, Aditya Joshi, Sarvnaz Karimi et al.

Personal health mention detection deals with predicting whether or not a given sentence is a report of a health condition. Past work mentions errors in this prediction when symptom words, i.e. names of symptoms of interest, are used in a figurative sense. Therefore, we combine a state-of-the-art figurative usage detection with CNN-based personal health mention detection. To do so, we present two methods: a pipeline-based approach and a feature augmentation-based approach. The introduction of figurative usage detection results in an average improvement of 2.21% F-score of personal health mention detection, in the case of the feature augmentation-based approach. This paper demonstrates the promise of using figurative usage detection to improve personal health mention detection.

31.2CLJun 4, 2019

NNE: A Dataset for Nested Named Entity Recognition in English Newswire

Nicky Ringland, Xiang Dai, Ben Hachey et al.

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE---a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.

31.2CLApr 1, 2019Code

Xiang Dai, Sarvnaz Karimi, Ben Hachey et al.

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

1.3CLMar 14, 2019

Survey of Text-based Epidemic Intelligence: A Computational Linguistic Perspective

Aditya Joshi, Sarvnaz Karimi, Ross Sparks et al.

Epidemic intelligence deals with the detection of disease outbreaks using formal (such as hospital records) and informal sources (such as user-generated text on the web) of information. In this survey, we discuss approaches for epidemic intelligence that use textual datasets, referring to it as `text-based epidemic intelligence'. We view past work in terms of two broad categories: health mention classification (selecting relevant text from a large volume) and health event detection (predicting epidemic events from a collection of relevant text). The focus of our discussion is the underlying computational linguistic techniques in the two categories. The survey also provides details of the state-of-the-art in annotation techniques, resources and evaluation strategies for epidemic intelligence.

3.2IRJan 29, 2018

Benchmarking Clinical Decision Support Search

Vincent Nguyen, Sarvnaz Karimi, Sara Falamaki et al.

Finding relevant literature underpins the practice of evidence-based medicine. From 2014 to 2016, TREC conducted a clinical decision support track, wherein participants were tasked with finding articles relevant to clinical questions posed by physicians. In total, 87 teams have participated over the past three years, generating 395 runs. During this period, each team has trialled a variety of methods. While there was significant overlap in the methods employed by different teams, the results were varied. Due to the diversity of the platforms used, the results arising from the different techniques are not directly comparable, reducing the ability to build on previous work. By using a stable platform, we have been able to compare different document and query processing techniques, allowing us to experiment with different search parameters. We have used our system to reproduce leading teams runs, and compare the results obtained. By benchmarking our indexing and search techniques, we can statistically test a variety of hypotheses, paving the way for further research.

2.9AIApr 27, 2015

Concept Extraction to Identify Adverse Drug Reactions in Medical Forums: A Comparison of Algorithms

Alejandro Metke-Jimenez, Sarvnaz Karimi

Social media is becoming an increasingly important source of information to complement traditional pharmacovigilance methods. In order to identify signals of potential adverse drug reactions, it is necessary to first identify medical concepts in the social media text. Most of the existing studies use dictionary-based methods which are not evaluated independently from the overall signal detection task. We compare different approaches to automatically identify and normalise medical concepts in consumer reviews in medical forums. Specifically, we implement several dictionary-based methods popular in the relevant literature, as well as a method we suggest based on a state-of-the-art machine learning method for entity recognition. MetaMap, a popular biomedical concept extraction tool, is used as a baseline. Our evaluations were performed in a controlled setting on a common corpus which is a collection of medical forum posts annotated with concepts and linked to controlled vocabularies such as MedDRA and SNOMED CT. To our knowledge, our study is the first to systematically examine the effect of popular concept extraction methods in the area of signal detection for adverse reactions. We show that the choice of algorithm or controlled vocabulary has a significant impact on concept extraction, which will impact the overall signal detection process. We also show that our proposed machine learning approach significantly outperforms all the other methods in identification of both adverse reactions and drugs, even when trained with a relatively small set of annotated text.