Sagnik Ray Choudhury

CL
h-index42
20papers
2,737citations
Novelty40%
AI Score53

20 Papers

CLSep 15, 2022
Machine Reading, Fast and Slow: When Do Models "Understand" Language?

Sagnik Ray Choudhury, Anna Rogers, Isabelle Augenstein

Two of the most fundamental challenges in Natural Language Understanding (NLU) at present are: (a) how to establish whether deep learning-based models score highly on NLU benchmarks for the 'right' reasons; and (b) to understand what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic 'skills': coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be 'reading slowly', and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the 'right' information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.

CLMar 10Code
S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti, Sagnik Ray Choudhury

Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

CLOct 20, 2023
Explaining Interactions Between Text Spans

Sagnik Ray Choudhury, Pepa Atanasova, Isabelle Augenstein

Reasoning over spans of tokens from different parts of the input is essential for natural language understanding (NLU) tasks such as fact-checking (FC), machine reading comprehension (MRC) or natural language inference (NLI). However, existing highlight-based explanations primarily focus on identifying individual important tokens or interactions only between adjacent tokens or tuples of tokens. Most notably, there is a lack of annotations capturing the human decision-making process w.r.t. the necessary interactions for informed decision-making in such tasks. To bridge this gap, we introduce SpanEx, a multi-annotator dataset of human span interaction explanations for two NLU tasks: NLI and FC. We then investigate the decision-making processes of multiple fine-tuned large language models in terms of the employed connections between spans in separate parts of the input and compare them to the human reasoning processes. Finally, we present a novel community detection based unsupervised method to extract such interaction explanations from a model's inner workings.

CLOct 20, 2023
Implications of Annotation Artifacts in Edge Probing Test Datasets

Sagnik Ray Choudhury, Jushaan Kalra

Edge probing tests are classification tasks that test for grammatical knowledge encoded in token representations coming from contextual encoders such as large language models (LLMs). Many LLM encoders have shown high performance in EP tests, leading to conjectures about their ability to encode linguistic knowledge. However, a large body of research claims that the tests necessarily do not measure the LLM's capacity to encode knowledge, but rather reflect the classifiers' ability to learn the problem. Much of this criticism stems from the fact that often the classifiers have very similar accuracy when an LLM vs a random encoder is used. Consequently, several modifications to the tests have been suggested, including information theoretic probes. We show that commonly used edge probing test datasets have various biases including memorization. When these biases are removed, the LLM encoders do show a significant difference from the random ones, even with the simple non-information theoretic probes.

CYMay 14, 2025Code
Towards Automated Situation Awareness: A RAG-Based Framework for Peacebuilding Reports

Poli A. Nemkova, Suleyman O. Polat, Rafid I. Jahan et al.

Timely and accurate situation awareness is vital for decision-making in humanitarian response, conflict monitoring, and early warning and early action. However, the manual analysis of vast and heterogeneous data sources often results in delays, limiting the effectiveness of interventions. This paper introduces a dynamic Retrieval-Augmented Generation (RAG) system that autonomously generates situation awareness reports by integrating real-time data from diverse sources, including news articles, conflict event databases, and economic indicators. Our system constructs query-specific knowledge bases on demand, ensuring timely, relevant, and accurate insights. To ensure the quality of generated reports, we propose a three-level evaluation framework that combines semantic similarity metrics, factual consistency checks, and expert feedback. The first level employs automated NLP metrics to assess coherence and factual accuracy. The second level involves human expert evaluation to verify the relevance and completeness of the reports. The third level utilizes LLM-as-a-Judge, where large language models provide an additional layer of assessment to ensure robustness. The system is tested across multiple real-world scenarios, demonstrating its effectiveness in producing coherent, insightful, and actionable reports. By automating report generation, our approach reduces the burden on human analysts and accelerates decision-making processes. To promote reproducibility and further research, we openly share our code and evaluation tools with the community via GitHub.

CLOct 9, 2020Code
Constrained Decoding for Computationally Efficient Named Entity Recognition Taggers

Brian Lester, Daniel Pressel, Amy Hemmeter et al.

Current state-of-the-art models for named entity recognition (NER) are neural models with a conditional random field (CRF) as the final layer. Entities are represented as per-token labels with a special structure in order to decode them into spans. Current work eschews prior knowledge of how the span encoding scheme works and relies on the CRF learning which transitions are illegal and which are not to facilitate global coherence. We find that by constraining the output to suppress illegal transitions we can train a tagger with a cross-entropy loss twice as fast as a CRF with differences in F1 that are statistically insignificant, effectively eliminating the need for a CRF. We analyze the dynamics of tag co-occurrence to explain when these constraints are most effective and provide open source implementations of our tagger in both PyTorch and TensorFlow.

CLSep 30, 2020Code
Multiple Word Embeddings for Increased Diversity of Representation

Brian Lester, Daniel Pressel, Amy Hemmeter et al.

Most state-of-the-art models in natural language processing (NLP) are neural models built on top of large, pre-trained, contextual language models that generate representations of words in context and are fine-tuned for the task at hand. The improvements afforded by these "contextual embeddings" come with a high computational cost. In this work, we explore a simple technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We concatenate multiple pre-trained embeddings to strengthen our representation of words. We show that this concatenation technique works across many tasks, datasets, and model types. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity between different pre-trained embeddings is the driving force of why this technique works. We provide open source implementations of our models in both TensorFlow and PyTorch.

CLJan 5, 2020Code
Computationally Efficient NER Taggers with Combined Embeddings and Constrained Decoding

Brian Lester, Daniel Pressel, Amy Hemmeter et al.

Current State-of-the-Art models in Named Entity Recognition (NER) are neural models with a Conditional Random Field (CRF) as the final network layer, and pre-trained "contextual embeddings". The CRF layer is used to facilitate global coherence between labels, and the contextual embeddings provide a better representation of words in context. However, both of these improvements come at a high computational cost. In this work, we explore two simple techniques that substantially improve NER performance over a strong baseline with negligible cost. First, we use multiple pre-trained embeddings as word representations via concatenation. Second, we constrain the tagger, trained using a cross-entropy loss, during decoding to eliminate illegal transitions. While training a tagger on CoNLL 2003 we find a $786$\% speed-up over a contextual embeddings-based tagger without sacrificing strong performance. We also show that the concatenation technique works across multiple tasks and datasets. We analyze aspects of similarity and coverage between pre-trained embeddings and the dynamics of tag co-occurrence to explain why these techniques work. We provide an open source implementation of our tagger using these techniques in three popular deep learning frameworks --- TensorFlow, Pytorch, and DyNet.

IRJan 4, 2018Code
Text Extraction and Retrieval from Smartphone Screenshots: Building a Repository for Life in Media

Agnese Chiatti, Mu Jung Cho, Anupriya Gagneja et al.

Daily engagement in life experiences is increasingly interwoven with mobile device use. Screen capture at the scale of seconds is being used in behavioral studies and to implement "just-in-time" health interventions. The increasing psychological breadth of digital information will continue to make the actual screens that people view a preferred if not required source of data about life experiences. Effective and efficient Information Extraction and Retrieval from digital screenshots is a crucial prerequisite to successful use of screen data. In this paper, we present the experimental workflow we exploited to: (i) pre-process a unique collection of screen captures, (ii) extract unstructured text embedded in the images, (iii) organize image text and metadata based on a structured schema, (iv) index the resulting document collection, and (v) allow for Image Retrieval through a dedicated vertical search engine application. The adopted procedure integrates different open source libraries for traditional image processing, Optical Character Recognition (OCR), and Image Retrieval. Our aim is to assess whether and how state-of-the-art methodologies can be applied to this novel data set. We show how combining OpenCV-based pre-processing modules with a Long short-term memory (LSTM) based release of Tesseract OCR, without ad hoc training, led to a 74% character-level accuracy of the extracted text. Further, we used the processed repository as baseline for a dedicated Image Retrieval system, for the immediate use and application for behavioral and prevention scientists. We discuss issues of Text Information Extraction and Retrieval that are particular to the screenshot image case and suggest important future work.

CLMay 28, 2025
NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment

Antonia Karamolegkou, Angana Borah, Eunjung Cho et al.

Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

CLSep 15, 2025
ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims

Anirban Saha Anik, Md Fahimul Kabir Chowdhury, Andrew Wyckoff et al.

This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses on verifying numerical and temporal claims using retrieved evidence. We explore two complementary approaches: zero-shot prompting with instruction-tuned large language models (LLMs) and supervised fine-tuning using parameter-efficient LoRA. To enhance evidence quality, we investigate several selection strategies, including full-document input and top-k sentence filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned with LoRA achieves strong performance on the English validation set. However, a notable drop in the test set highlights a generalization challenge. These findings underscore the importance of evidence granularity and model adaptation for robust numerical fact verification.

CLMay 14, 2025
Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

Apollinaire Poli Nemkova, Sarath Chandra Lingareddy, Sagnik Ray Choudhury et al.

Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

DLMay 22, 2025
BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text

Ibrahim Al Azher, Miftahul Jannat Mokarrama, Zhishuai Guo et al.

In scientific research, ``limitations'' refer to the shortcomings, constraints, or weaknesses of a study. A transparent reporting of such limitations can enhance the quality and reproducibility of research and improve public trust in science. However, authors often underreport limitations in their papers and rely on hedging strategies to meet editorial requirements at the expense of readers' clarity and confidence. This tendency, combined with the surge in scientific publications, has created a pressing need for automated approaches to extract and generate limitations from scholarly papers. To address this need, we present a full architecture for computational analysis of research limitations. Specifically, we (1) create a dataset of limitations from ACL, NeurIPS, and PeerJ papers by extracting them from the text and supplementing them with external reviews; (2) we propose methods to automatically generate limitations using a novel Retrieval Augmented Generation (RAG) technique; (3) we design a fine-grained evaluation framework for generated limitations, along with a meta-evaluation of these techniques.

CLOct 3, 2025
Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury et al.

Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

CYJun 24, 2025
A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models

Sadia Kamal, Lalu Prasad Yadav Prakash, S M Rafiuddin et al.

The Political Compass Test (PCT) and similar surveys are commonly used to assess political bias in auto-regressive LLMs. Our rigorous statistical experiments show that while changes to standard generation parameters have minimal effect on PCT scores, prompt phrasing and fine-tuning individually and together can significantly influence results. Interestingly, fine-tuning on politically rich vs. neutral datasets does not lead to different shifts in scores. We also generalize these findings to a similar popular test called 8 Values. Humans do not change their responses to questions when prompted differently (``answer this question'' vs ``state your opinion''), or after exposure to politically neutral text, such as mathematical formulae. But the fact that the models do so raises concerns about the validity of these tests for measuring model bias, and paves the way for deeper exploration into how political and social views are encoded in LLMs.

CLMar 20, 2025
FutureGen: A RAG-based Approach to Generate the Future Work of Scientific Article

Ibrahim Al Azher, Miftahul Jannat Mokarrama, Zhishuai Guo et al.

The Future Work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study. This section serves as a valuable resource for early-career researchers seeking unexplored areas and experienced researchers looking for new projects or collaborations. In this study, we generate future work suggestions from a scientific article. To enrich the generation process with broader insights and reduce the chance of missing important research directions, we use context from related papers using RAG. We experimented with various Large Language Models (LLMs) integrated into Retrieval-Augmented Generation (RAG). We incorporate an LLM feedback mechanism to enhance the quality of the generated content and introduce an LLM-as-a-judge framework for robust evaluation, assessing key aspects such as novelty, hallucination, and feasibility. Our results demonstrate that the RAG-based approach using GPT-4o mini, combined with an LLM feedback mechanism, outperforms other methods based on both qualitative and quantitative evaluations. Moreover, we conduct a human evaluation to assess the LLM as an extractor, generator, and feedback provider.

CLSep 15, 2021
Can Edge Probing Tasks Reveal Linguistic Knowledge in QA Models?

Sagnik Ray Choudhury, Nikita Bhutani, Isabelle Augenstein

There have been many efforts to try to understand what grammatical knowledge (e.g., ability to understand the part of speech of a token) is encoded in large pre-trained language models (LM). This is done through `Edge Probing' (EP) tests: supervised classification tasks to predict the grammatical properties of a span (whether it has a particular part of speech) using only the token representations coming from the LM encoder. However, most NLP applications fine-tune these LM encoders for specific tasks. Here, we ask: if an LM is fine-tuned, does the encoding of linguistic information in it change, as measured by EP tests? Specifically, we focus on the task of Question Answering (QA) and conduct experiments on multiple datasets. We find that EP test results do not change significantly when the fine-tuned model performs well or in adversarial situations where the model is forced to learn wrong correlations. From a similar finding, some recent papers conclude that fine-tuning does not change linguistic knowledge in encoders but they do not provide an explanation. We find that EP models themselves are susceptible to exploiting spurious correlations in the EP datasets. When this dataset bias is corrected, we do see an improvement in the EP test results as expected.

CLApr 18, 2021
Intent Features for Rich Natural Language Understanding

Brian Lester, Sagnik Ray Choudhury, Rashmi Prasad et al.

Complex natural language understanding modules in dialog systems have a richer understanding of user utterances, and thus are critical in providing a better user experience. However, these models are often created from scratch, for specific clients and use cases, and require the annotation of large datasets. This encourages the sharing of annotated data across multiple clients. To facilitate this we introduce the idea of intent features: domain and topic agnostic properties of intents that can be learned from the syntactic cues only, and hence can be shared. We introduce a new neural network architecture, the Global-Local model, that shows significant improvement over strong baselines for identifying these features in a deployed, multi-intent natural language understanding module, and, more generally, in a classification setting where a part of an utterance has to be classified utilizing the whole context.

CLApr 15, 2021
Quantifying Gender Bias Towards Politicians in Cross-Lingual Language Models

Karolina Stańczak, Sagnik Ray Choudhury, Tiago Pimentel et al.

Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language. The present paper introduces a simple method for probing language models to conduct a multilingual study of gender bias towards politicians. We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender. To this end, we curate a dataset of 250k politicians worldwide, including their names and gender. Our study is conducted in seven languages across six different language modeling architectures. The results demonstrate that pre-trained language models' stance towards politicians varies strongly across analyzed languages. We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians. Finally, and contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.

CLJan 25, 2014
Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks

Shibamouli Lahiri, Sagnik Ray Choudhury, Cornelia Caragea

Keyword and keyphrase extraction is an important problem in natural language processing, with applications ranging from summarization to semantic search to document clustering. Graph-based approaches to keyword and keyphrase extraction avoid the problem of acquiring a large in-domain training corpus by applying variants of PageRank algorithm on a network of words. Although graph-based approaches are knowledge-lean and easily adoptable in online systems, it remains largely open whether they can benefit from centrality measures other than PageRank. In this paper, we experiment with an array of centrality measures on word and noun phrase collocation networks, and analyze their performance on four benchmark datasets. Not only are there centrality measures that perform as well as or better than PageRank, but they are much simpler (e.g., degree, strength, and neighborhood size). Furthermore, centrality-based methods give results that are competitive with and, in some cases, better than two strong unsupervised baselines.