Gouri Ginde

SE
h-index5
13papers
40citations
Novelty30%
AI Score48

13 Papers

IVAug 20, 2024
ISLES'24: Final Infarct Prediction with Multimodal Imaging and Clinical Data. Where Do We Stand?

Ezequiel de la Rosa, Ruisheng Su, Mauricio Reyes et al.

Accurate estimation of brain infarction (i.e., irreversibly damaged tissue) is critical for guiding treatment decisions in acute ischemic stroke. Reliable infarct prediction informs key clinical interventions, including the need for patient transfer to comprehensive stroke centers, the potential benefit of additional reperfusion attempts during mechanical thrombectomy, decisions regarding secondary neuroprotective treatments, and ultimately, prognosis of clinical outcomes. This work introduces the Ischemic Stroke Lesion Segmentation (ISLES) 2024 challenge, which focuses on the prediction of final infarct volumes from pre-interventional acute stroke imaging and clinical data. ISLES24 provides a comprehensive, multimodal setting where participants can leverage all clinically and practically available data, including full acute CT imaging, sub-acute follow-up MRI, and structured clinical information, across a train set of 150 cases. On the hidden test set of 98 cases, the top-performing model, a multimodal nnU-Net-based architecture, achieved a Dice score of 0.285 (+/- 0.213) and an absolute volume difference of 21.2 (+/- 37.2) mL, underlining the significant challenges posed by this task and the need for further advances in multimodal learning. This work makes two primary contributions: first, we establish a standardized, clinically realistic benchmark for post-treatment infarct prediction, enabling systematic evaluation of multimodal algorithmic strategies on a longitudinal stroke dataset; second, we analyze current methodological limitations and outline key research directions to guide the development of next-generation infarct prediction models.

AIAug 28, 2024
Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems

Farzaneh Dehghani, Mahsa Dibaji, Fahim Anzum et al.

Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.

SEMay 22
Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

Hinduja Nirujan, Shreyas Patil, Abdallah Ayoub et al.

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

SEApr 18
Exploring Ethical Concerns of Mobile Applications from App Reviews: A Literature Survey

Aakash Sorathiya, Gouri Ginde

Privacy, security, and accessibility, like ethical concerns in mobile applications (a.k.a. apps), commonly subsumed under non-functional requirements, are generally reported by users through app reviews available in app stores. However, these remain unidentified among other types of reviews, such as user experiences, problem reports, and new feature discussions. Over the past decade, extensive research has focused on extracting valuable information from app reviews, including feature requests and bug reports. However, there remains a lack of a synthesis of research related to app review analysis for exploring users' ethical concerns. This paper presents a comprehensive survey of this research area, covering 37 relevant studies published since 2012, identified from the initial 553 studies using specific inclusion and exclusion criteria. The studies examined vary in review counts, ranging from 500 to 626 million, and include between a single and 1.3 million apps. Our detailed analysis highlights diverse objectives, methodologies, and strategies, along with additional resources such as app privacy policies, which researchers generally utilize to analyze ethical concerns. Our findings also identify persistent barriers to privacy, security, accessibility, transparency, fairness, accountability, and safety, as reported by users in app reviews. Furthermore, we propose a research agenda that focuses on four key areas, including automated extraction and classification of ethical concerns-related app reviews. Our survey outcomes can assist developers and system architects in recognizing and prioritizing non-functional requirements at the initial stages of the development lifecycle, whereas researchers can expand upon this synthesis to create tools for the automated detection of ethical concerns.

SEApr 26, 2025Code
Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Jagrit Acharya, Gouri Ginde

Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emph{Qwen 2.5, Mistral, and Llama 3.2}) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf{77%}, outperforming both fine-tuned Mistral (\textbf{71%}), Llama 3.2 (\textbf{63%}) and ChatGPT in 3-shot learning (\textbf{75%}). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf{70%} CTQRS in unseen projects' bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.

SEMar 25
Towards Energy-aware Requirements Dependency Classification: Knowledge-Graph vs. Vector-Retrieval Augmented Inference with SLMs

Shreyas Patil, Pragati Kumari, Novarun Deb et al.

The continuous evolution of system specifications necessitates frequent evaluation of conflicting requirements, a process that is traditionally labour intensive. Although large language models (LLMs) have demonstrated significant potential for automating this detection, their massive computational requirements often result in excessive energy waste. Consequently, there is a growing need to transition toward Small Language Models (SLMs) and energy aware architectures for sustainable Requirements Engineering. This study proposes and empirically evaluates an energy aware framework that compares Knowledge Graph-based Retrieval (KGR) with Vector-based Semantic Retrieval (VSR) to enhance SLM-based inference at the 7B to 8B parameter scale. By leveraging structured graph traversal and high dimensional semantic mapping, we extract candidate requirements, which are then classified as conflicting or neutral by an inference engine. We evaluate these retrieval enhanced strategies across Zero-Shot, Few-Shot, and Chain of Thoughts prompting methods. Using a three-pillar sustainability framework measuring energy consumption (Wh), latency (s), and carbon emissions (gCO2eq) alongside standard accuracy metrics (F1 Score), this research provides a first systematic empirical evaluation and trade off analysis between predictive performance and environmental impact. Our findings highlight the effectiveness of structured versus semantic retrieval in detecting requirement conflicts, offering a reproducible, sustainability aware architecture for energy efficient requirement engineering.

CVJul 28, 2025Code
Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging

Amirmohammad Shamaei, Alexander Stebner, Salome et al.

Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.

SEMay 8
What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

Junyu Huo, Ziqi Mao, Zihao Wan et al.

AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known about the software-engineering discourse autonomous AI agents produce when they interact primarily with one another. This paper examines what autonomous AI agents discuss in MoltBook, an AI-agents-only social network, how that discourse is organized, and how it differs from human developer discourse. We combine human open coding of a 500-post sample, a concentration-plus-check topic-analysis pipeline over 4,707 English-filtered MoltBook technology posts, and a matched-instrument comparison against 5,211 GitHub Discussions posts. MoltBook technology discourse spans 12 recurring themes and is led by Security and Trust (27.4%). At the community level, activity is highly concentrated: the largest submolt contains 63.5% of posts and the Gini coefficient is 0.88, yet a stability-aware BERTopic pipeline still yields 32 non-outlier sub-topics. Compared with the GitHub Discussions baseline, MoltBook discourse contains fewer concrete, context-rich cues such as code-formatted artifacts, environment details, runtime failures, and reproduction steps; social mimicry appears only in a limited way, while idealization is mainly reflected through lower hedging. Overall, AI-only technical discourse is coherent but selective. It repeatedly returns to concerns such as security and trust, memory and context management, tooling and APIs, debugging and error handling, workflow automation, and infrastructure/ops, while omitting much of the concrete runtime and project-local detail common in human developer discourse. This may be because MoltBook contains fewer environment-specific failures, reproduction steps, and other concrete grounding cues.

LGApr 21
Do Masked Autoencoders Improve Downhole Prediction? An Empirical Study on Real Well Drilling Data

Aleksander Berezowski, Hassan Hassanzadeh, Gouri Ginde

Downhole drilling telemetry presents a fundamental labeling asymmetry: surface sensor data are generated continuously at 1~Hz, while labeled downhole measurements are costly, intermittent, and scarce. Current machine learning approaches for downhole metric prediction universally adopt fully supervised training from scratch, which is poorly suited to this data regime. We present the first empirical evaluation of masked autoencoder (MAE) pretraining for downhole drilling metric prediction. Using two publicly available Utah FORGE geothermal wells comprising approximately 3.5 million timesteps of multivariate drilling telemetry, we conduct a systematic full-factorial design space search across 72 MAE configurations and compare them against supervised LSTM and GRU baselines on the task of predicting Total Mud Volume. Results show that the best MAE configuration reduces test mean absolute error by 19.8\% relative to the supervised GRU baseline, while trailing the supervised LSTM baseline by 6.4\%. Analysis of design dimensions reveals that latent space width is the dominant architectural choice (Pearson $r = -0.59$ with test MAE), while masking ratio has negligible effect, an unexpected finding attributed to high temporal redundancy in 1~Hz drilling data. These results establish MAE pretraining as a viable paradigm for drilling analytics and identify the conditions under which it is most beneficial.

LGApr 16
Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data

Aleksander Berezowski, Hassan Hassanzadeh, Gouri Ginde

Oil and gas drilling operations generate extensive time-series data from surface sensors, yet accurate real-time prediction of critical downhole metrics remains challenging due to the scarcity of labelled downhole measurements. This systematic mapping study reviews thirteen papers published between 2015 and 2025 to assess the potential of Masked Autoencoder Foundation Models (MAEFMs) for predicting downhole metrics from surface drilling data. The review identifies eight commonly collected surface metrics and seven target downhole metrics. Current approaches predominantly employ neural network architectures such as artificial neural networks (ANNs) and long short-term memory (LSTM) networks, yet no studies have explored MAEFMs despite their demonstrated effectiveness in time-series modeling. MAEFMs offer distinct advantages through self-supervised pre-training on abundant unlabeled data, enabling multi-task prediction and improved generalization across wells. This research establishes that MAEFMs represent a technically feasible but unexplored opportunity for drilling analytics, recommending future empirical validation of their performance against existing models and exploration of their broader applicability in oil and gas operations.

CLNov 11, 2024
Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

Aakash Sorathiya, Gouri Ginde

With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.

CLJul 29, 2025
Automatic Classification of User Requirements from Online Feedback -- A Replication Study

Meet Bhatt, Nic Boilard, Muhammad Rehan Chaudhary et al.

Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), "Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning", which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study's replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.

HCNov 3, 2016
Visualisation of massive data from scholarly Article and Journal Database A Novel Scheme

Gouri Ginde

Scholarly articles publishing and getting cited has become a way of life for academicians. These scholarly publications shape up the career growth of not only the authors but also of the country, continent and the technological domains. Author affiliations, country and other information of an author coupled with data analytics can provide useful and insightful results. However, massive and complete data is required to perform this research. Google scholar which is a comprehensive and free repository of scholarly articles has been used as a data source for this purpose. Data scraped from Google scholar when stored as a graph and visualized in the form of nodes and relationships, can offer discerning and concealed information. Such as, evident domain shift of an author, various research domains spread for an author, prediction of emerging domain and sub domains, detection of journal and author level citation cartel behaviors etc. The data from graph database is also used in computation of scholastic indicators for the journals. Eventually, econometric model, named Cobb Douglas model is used to compute the journals Modeling "Internationality" Index based on these scholastic indicators.