Tarannum Shaila Zaman

SE
h-index21
8papers
21citations
Novelty36%
AI Score47

8 Papers

HCMay 22
From Preventive to Reactive: How AI Coding Assistants Transform Developers' Security Awareness

Faisal Haque Bappy, Tahrim Hossain, Sidratul Muntaher Meheraj et al.

AI coding assistants are now central to professional software development, yet their impact on how developers think about and practice security remains poorly understood. While prior work has documented vulnerability rates in AI-generated code, a more fundamental question persists: how do these tools transform security awareness in authentic, ongoing development practice? We conducted semi-structured interviews with 15 professional software engineers and observed them completing security-relevant coding tasks with AI assistance, spanning 3 experience cohorts defined by their relationship to AI tools during professional formation. We find that AI coding assistants reorganize rather than eliminate security thinking, shifting it from the act of writing code to the act of reviewing it. This transition from preventive to reactive security is structurally encouraged by interaction models that frame code generation as a functional task, leaving security as an afterthought. Notably, none of our coding session participants specified security requirements in their initial prompts, even when they possessed the relevant knowledge, revealing a decoupling of security awareness from security behavior. We further document informal coping strategies developers had independently invented to manage AI security risk, none of which are supported by current tools or organizations, and find that the experience cohort did not reliably predict security performance. This paper contributes a practice-grounded account of how AI-assisted development reshapes the human side of secure coding, offering empirical foundations for the design of more security-aware tools, training programs, and organizational policies.

CLNov 10, 2025
EmoBang: Detecting Emotion From Bengali Texts

Abdullah Al Maruf, Aditi Golder, Zakaria Masud Jiyad et al.

Emotion detection from text seeks to identify an individual's emotional or mental state - positive, negative, or neutral - based on linguistic cues. While significant progress has been made for English and other high-resource languages, Bengali remains underexplored despite being the world's fourth most spoken language. The lack of large, standardized datasets classifies Bengali as a low-resource language for emotion detection. Existing studies mainly employ classical machine learning models with traditional feature engineering, yielding limited performance. In this paper, we introduce a new Bengali emotion dataset annotated across eight emotion categories and propose two models for automatic emotion detection: (i) a hybrid Convolutional Recurrent Neural Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder Representations from Transformers (BERT) ensemble model (EmoBangEnsemble). Additionally, we evaluate six baseline models with five feature engineering techniques and assess zero-shot and few-shot large language models (LLMs) on the dataset. To the best of our knowledge, this is the first comprehensive benchmark for Bengali emotion detection. Experimental results show that EmoBangH and EmoBangE achieve accuracies of 92.86% and 93.69%, respectively, outperforming existing methods and establishing strong baselines for future research.

SEMay 18
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

Anika Tabassum, Md Sifat Hossain, Md. Fahim Arefin et al.

Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.

SEMar 19
DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code

Nabiha Parvez, Tanvin Sarkar Pallab, Mia Mohammad Imran et al.

Debugging consumes a substantial portion of the software development lifecycle, yet the effectiveness of Large Language Models(LLMs) in this task is not well understood. Competitive programming offers a rich benchmark for such evaluation, given its diverse problem domains and strict efficiency requirements. We present an empirical study of LLM-based debugging on competitive programming problems and introduce DePro, a test-case driven approach that assists programmers by correcting existing code rather than generating new solutions. DePro combines brute-force reference generation, stress testing, and iterative LLM-guided refinement to identify and resolve errors efficiently.Experiments on 13 faulty user submissions from Codeforces demonstrate that DePro consistently produces correct solutions, reducing debugging attempts by up to 64% and debugging time by an average of 7.6 minutes per problem compared to human programmers and zero-shot LLM debugging.

CLFeb 4, 2025
LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving

Md Sifat Hossain, Anika Tabassum, Md. Fahim Arefin et al.

The rapid advancement of large language models has opened new avenues for automating complex problem-solving tasks such as algorithmic coding and competitive programming. This paper introduces a novel evaluation technique, LLM-ProS, to assess the performance of state-of-the-art LLMs on International Collegiate Programming Contest (ICPC) problems. Using a curated dataset of 166 World Finals problems from 2011 to 2024, we benchmark the models' reasoning, accuracy, and efficiency. We evaluate the five models-GPT-4o, Mistral Large, Llama-3.1-405B, and the o1 family, consisting of o1-mini and o1-preview, across critical metrics like correctness, resource utilization, and response calibration. Our results reveal significant differences in the models' abilities to generalize, adapt, and solve novel problems. We also investigated the impact of training methodologies, dataset contamination, and chain-of-thought reasoning on model performance. The findings provide new insights into optimizing LLMs for algorithmic tasks, highlighting both strengths and limitations of current models.

LGJan 11, 2025
EmoXpt: Analyzing Emotional Variances in Human Comments and LLM-Generated Responses

Shireesh Reddy Pyreddy, Tarannum Shaila Zaman

The widespread adoption of generative AI has generated diverse opinions, with individuals expressing both support and criticism of its applications. This study investigates the emotional dynamics surrounding generative AI by analyzing human tweets referencing terms such as ChatGPT, OpenAI, Copilot, and LLMs. To further understand the emotional intelligence of ChatGPT, we examine its responses to selected tweets, highlighting differences in sentiment between human comments and LLM-generated responses. We introduce EmoXpt, a sentiment analysis framework designed to assess both human perspectives on generative AI and the sentiment embedded in ChatGPT's responses. Unlike prior studies that focus exclusively on human sentiment, EmoXpt uniquely evaluates the emotional expression of ChatGPT. Experimental results demonstrate that LLM-generated responses are notably more efficient, cohesive, and consistently positive than human responses.

SEDec 17, 2025
OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering

Mia Mohammad Imran, Tarannum Shaila Zaman

Large Language Models (LLMs) are increasingly used in empirical software engineering (ESE) to automate or assist annotation tasks such as labeling commits, issues, and qualitative artifacts. Yet the reliability and reproducibility of such annotations remain underexplored. Existing studies often lack standardized measures for reliability, calibration, and drift, and frequently omit essential configuration details. We argue that LLM-based annotation should be treated as a measurement process rather than a purely automated activity. In this position paper, we outline the \textbf{Operationalization for LLM-based Annotation Framework (OLAF)}, a conceptual framework that organizes key constructs: \textit{reliability, calibration, drift, consensus, aggregation}, and \textit{transparency}. The paper aims to motivate methodological discussion and future empirical work toward more transparent and reproducible LLM-based annotation in software engineering research.

SEJan 12
SECite: Analyzing and Summarizing Citations in Software Engineering Literature

Shireesh Reddy Pyreddy, Khaja Valli Pathan, Hasan Masum et al.

Identifying the strengths and limitations of a research paper is a core component of any literature review. However, traditional summaries reflect only the authors' self-presented perspective. Analyzing how other researchers discuss and cite the paper can offer a deeper, more practical understanding of its contributions and shortcomings. In this research, we introduce SECite, a novel approach for evaluating scholarly impact through sentiment analysis of citation contexts. We develop a semi-automated pipeline to extract citations referencing nine research papers and apply advanced natural language processing (NLP) techniques with unsupervised machine learning to classify these citation statements as positive or negative. Beyond sentiment classification, we use generative AI to produce sentiment-specific summaries that capture the strengths and limitations of each target paper, derived both from clustered citation groups and from the full text. Our findings reveal meaningful patterns in how the academic community perceives these works, highlighting areas of alignment and divergence between external citation feedback and the authors' own presentation. By integrating citation sentiment analysis with LLM-based summarization, this study provides a comprehensive framework for assessing scholarly contributions.