Syed Ishtiaque Ahmed

CY
h-index28
23papers
120citations
Novelty31%
AI Score49

23 Papers

CLJun 2
Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed et al.

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

CLMay 28
When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Md Arid Hasan, Ruwad Naswan, Farhan Samir et al.

Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

HCMay 5
User Detection and Response Patterns of Sycophantic Behavior in Conversational AI

Kazi Noshin, Syed Ishtiaque Ahmed, Sharifa Sultana

Despite growing attention to LLM sycophancy from researchers and developers, users' own experiences of this behavior remain underexplored. We examine how everyday users experience AI sycophancy through Reddit discussions. Using our ODR Framework which maps user experiences through observation, detection, and response stages, we find that users identify sycophantic behavior through methods like cross-platform comparison and consistency testing. They employ various mitigation strategies, including persona-based prompting and specific language engineering techniques. Our findings suggest that sycophancy does not have a uniformly negative effect; its impact differs by context. Users facing trauma, mental health struggles, or isolation often actively seek affirmative AI responses for emotional support. Users construct both technical and informal theories to explain sycophantic outputs. Users construct both technical and informal theories to explain sycophantic outputs. These findings suggest eliminating sycophancy entirely may be misguided. We argue for context-aware AI design that balances risks against benefits of affirmative interaction, with implications for user education and system transparency.

SEMar 10
"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering

Jiessie Tie, Bingsheng Yao, Tianshi Li et al.

Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. While some engineers find them useful, others deem them counterproductive due to inaccuracies in their responses. Researchers have also observed that ChatGPT often provides incorrect information. Given these limitations, it is crucial to determine how to effectively integrate LLMs into software engineering (SE) workflow. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Users attempted to mitigate these issues through scaffolding, prompt clarification, and debugging. However, 17 participants ultimately chose to abandon ChatGPT due to persistent failures. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%. This study advances the understanding of human-AI interaction in SE tasks and outlines directions for future research and tooling support.

HCMar 23
Embodying Facts, Figures, and Faiths in Narrative Artistic Performances in Rural Bangladesh

Sharifa Sultana, Zinnat Sultana, Jeffrey M. Rzeszotarski et al.

There is an increasing interest in telling serious stories with data. Designers organize information, construct narratives, and present findings to inform audiences. However, many of these practices emerge from modern information visualization rhetoric and ethical frameworks which may marginalize communities with low digital and media literacy. In a ten-month-long ethnographic study in three Bangladeshi villages, we investigated how these communities use entertainment and cultural practices, namely Puthi, Bhandari Gaan, and Pot music, to instruct, communicate traditional moral lessons and recall history. We found that these communities embrace polyvocality and multiple ethical frameworks in their performances, construct narratives combining factuality, emotionality, and aesthetics, and adapt their performances to changing technology and audience needs. Our findings provide HCI, visualization, and ethical data practitioners with implications for the design of accessible and culturally appropriate ways of presenting data narratives in data-driven systems.

AIApr 16
Enhancing Mental Health Counseling Support in Bangladesh using Culturally-Grounded Knowledge

Md Arid Hasan, Azhagu Meena SP, Aditya Khan et al.

Large language models (LLMs) show promise in generating supportive responses for mental health and counseling applications. However, their responses often lack cultural sensitivity, contextual grounding, and clinically appropriate guidance. This work addresses the gap of how to systematically incorporate domain-specific, clinically validated knowledge into LLMs to improve counseling quality. We utilize and compare two approaches, retrieval-augmented generation (RAG) and a knowledge graph (KG)-based method, designed to support para-counselors. Our KG is constructed manually and clinically validated, capturing causal relationships between stressors, interventions, and outcomes, with contributions from multidisciplinary people. We evaluated multiple LLMs in both settings using BERTScore F1 and SBERT cosine similarity, as well as human evaluation across five metrics, which is designed to directly measure the effectiveness of counseling beyond similarity at the surface level. The results show that KG-based approaches consistently improve contextual relevance, clinical appropriateness, and practical usability compared to RAG alone, demonstrating that structured, expert-validated knowledge plays a critical role in addressing LLMs limitations in counseling tasks.

AIApr 16
Bureaucratic Silences: What the Canadian AI Register Reveals, Omits, and Obscures

Dipto Das, Christelle Tessono, Syed Ishtiaque Ahmed et al.

In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register's complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of "sovereign AI" and the reality of bureaucratic practice: while 86\% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as "reliable tooling" rather than "contestable decision-making." We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.

CLJun 23, 2025Code
Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed et al. · utoronto

The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.

CLApr 30
Are You the A-hole? A Fair, Multi-Perspective Ethical Reasoning Framework

Sheza Munir, Ahanaf Rodoshi, Sumin Lee et al.

Standard methods for aggregating natural language judgments, such as majority voting, often fail to produce logically consistent results when applied to high-conflict domains, treating differing opinions as noise. We propose a neuro-symbolic aggregation framework that formalizes conflict resolution through Weighted Maximum Satisfiability (MaxSAT). Our pipeline utilizes a language model to map unstructured natural language explanations into interpretable logical predicates and confidence weights. These components are then encoded as soft constraints within the Z3 solver, transforming the aggregation problem into an optimization task that seeks the maximum consistency across conflicting testimony. Using the Reddit r/AmItheAsshole forum as a case study in large-scale moral disagreement, our system generates logically coherent verdicts that diverge from popularity-based labels 62% of the time, corroborated by an 86% agreement rate with independent human evaluators. This study demonstrates the efficacy of coupling neural semantic extraction with formal solvers to enforce logical soundness and explainability in the aggregation of noisy human reasoning.

CLOct 2, 2025
LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Md Arid Hasan, Firoj Alam, Md Fahad Hossain et al. · utoronto

Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

HCFeb 18, 2025
Talking About the Assumption in the Room

Ramaravind Kommiya Mothilal, Faisal M. Lalani, Syed Ishtiaque Ahmed et al. · utoronto

The reference to assumptions in how practitioners use or interact with machine learning (ML) systems is ubiquitous in HCI and responsible ML discourse. However, what remains unclear from prior works is the conceptualization of assumptions and how practitioners identify and handle assumptions throughout their workflows. This leads to confusion about what assumptions are and what needs to be done with them. We use the concept of an argument from Informal Logic, a branch of Philosophy, to offer a new perspective to understand and explicate the confusions surrounding assumptions. Through semi-structured interviews with 22 ML practitioners, we find what contributes most to these confusions is how independently assumptions are constructed, how reactively and reflectively they are handled, and how nebulously they are recorded. Our study brings the peripheral discussion of assumptions in ML to the center and presents recommendations for practitioners to better think about and work with assumptions.

CYDec 19, 2023
Ethical Artificial Intelligence Principles and Guidelines for the Governance and Utilization of Highly Advanced Large Language Models

Soaad Hossain, Syed Ishtiaque Ahmed

Given the success of ChatGPT, LaMDA and other large language models (LLMs), there has been an increase in development and usage of LLMs within the technology sector and other sectors. While the level in which LLMs has not reached a level where it has surpassed human intelligence, there will be a time when it will. Such LLMs can be referred to as advanced LLMs. Currently, there are limited usage of ethical artificial intelligence (AI) principles and guidelines addressing advanced LLMs due to the fact that we have not reached that point yet. However, this is a problem as once we do reach that point, we will not be adequately prepared to deal with the aftermath of it in an ethical and optimal way, which will lead to undesired and unexpected consequences. This paper addresses this issue by discussing what ethical AI principles and guidelines can be used to address highly advanced LLMs.

AIFeb 11
Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

Sheza Munir, Benjamin Mah, Krisha Kalsi et al.

In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models. However, the foundational "ground truth" paradigm rests on a positivistic fallacy that treats human disagreement as technical noise rather than a vital sociotechnical signal. This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues: ACL, AIES, CHI, CSCW, EAAMO, FAccT, and NeurIPS, investigating the mechanisms in data annotation practices that facilitate this "consensus trap". Our identification phase captured 30,897 records, which were refined via a tiered keyword filtration schema to a high-recall corpus of 3,042 records for manual screening, resulting in a final included corpus of 346 papers for qualitative synthesis. Our reflexive thematic analysis reveals that systemic failures in positional legibility, combined with the recent architectural shift toward human-as-verifier models, specifically the reliance on model-mediated annotations, introduce deep-seated anchoring bias and effectively remove human voices from the loop. We further demonstrate how geographic hegemony imposes Western norms as universal benchmarks, often enforced by the performative alignment of precarious data workers who prioritize requester compliance over honest subjectivity to avoid economic penalties. Critiquing the "noisy sensor" fallacy, where statistical models misdiagnose cultural pluralism as random error, we argue for reclaiming disagreement as a high-fidelity signal essential for building culturally competent models. To address these systemic tensions, we propose a roadmap for pluralistic annotation infrastructures that shift the objective from discovering a singular "right" answer to mapping the diversity of human experience.

CYNov 7, 2024
Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry

Soaad Hossain, James Rasalingam, Arhum Waheed et al.

With the growing interest in using AI and machine learning (ML) in medicine, there is an increasing number of literature covering the application and ethics of using AI and ML in areas of medicine such as clinical psychiatry. The problem is that there is little literature covering the economic aspects associated with using ML in clinical psychiatry. This study addresses this gap by specifically studying the economic implications of using ML in clinical psychiatry. In this paper, we evaluate the economic implications of using ML in clinical psychiatry through using three problem-oriented case studies, literature on economics, socioeconomic and medical AI, and two types of health economic evaluations. In addition, we provide details on fairness, legal, ethics and other considerations for ML in clinical psychiatry.

CRNov 19, 2021
RacketStore: Measurements of ASO Deception in Google Play via Mobile and App Usage

Nestor Hernandez, Ruben Recabarren, Bogdan Carbunar et al.

Online app search optimization (ASO) platforms that provide bulk installs and fake reviews for paying app developers in order to fraudulently boost their search rank in app stores, were shown to employ diverse and complex strategies that successfully evade state-of-the-art detection methods. In this paper we introduce RacketStore, a platform to collect data from Android devices of participating ASO providers and regular users, on their interactions with apps which they install from the Google Play Store. We present measurements from a study of 943 installs of RacketStore on 803 unique devices controlled by ASO providers and regular users, that consists of 58,362,249 data snapshots collected from these devices, the 12,341 apps installed on them and their 110,511,637 Google Play reviews. We reveal significant differences between ASO providers and regular users in terms of the number and types of user accounts registered on their devices, the number of apps they review, and the intervals between the installation times of apps and their review times. We leverage these insights to introduce features that model the usage of apps and devices, and show that they can train supervised learning algorithms to detect paid app installs and fake reviews with an F1-measure of 99.72% (AUC above 0.99), and detect devices controlled by ASO providers with an F1-measure of 95.29% (AUC = 0.95). We discuss the costs associated with evading detection by our classifiers and also the potential for app stores to use our approach to detect ASO work with privacy.

HCSep 7, 2021
Understanding the Social Determinants of Mental Health of the Undergraduate Students in Bangladesh: Interview Study

Ananya Bhattacharjee, S M Taiabul Haque, Abdul Hady et al.

Objective: This study aims to identify the social determinants of mental health among undergraduate students in Bangladesh, a developing nation in South Asia. Our goal is to identify the broader social determinants of mental health among this population, study the manifestation of these determinants in their day-to-day life, and explore the feasibility of self-monitoring tools in helping them identify the specific factors or relationships that impact their mental health. Methods: We conducted a 21-day study with 38 undergraduate students from seven universities in Bangladesh. We conducted two semi-structured interviews: one pre-study and one post-study. During the 21-day study, participants used an Android application to self-report and self-monitor their mood after each phone conversation. The app prompted participants to report their mood after each phone conversation and provided graphs and charts so that participants could independently review their mood and conversation patterns. Results: Our results show that academics, family, job and economic condition, romantic relationships, and religion are the major social determinants of mental health among undergraduate students in Bangladesh. Our app helped the participants pinpoint the specific issues related to these factors as participants could review the pattern of their moods and emotions from past conversation history. Although our app does not provide any explicit recommendation, participants took certain steps on their own to improve their mental health (e.g., reduced the frequency of communication with certain persons). Conclusions: Overall, the findings from this study would provide better insights for the researchers to design better solutions to help the younger population from this part of the world.

CYJun 16, 2021
A Fair and Ethical Healthcare Artificial Intelligence System for Monitoring Driver Behavior and Preventing Road Accidents

Soraia Oueida, Soaad Hossain, Yehia Kotb et al.

This paper presents a new approach to prevent transportation accidents and monitor driver's behavior using a healthcare AI system that incorporates fairness and ethics. Dangerous medical cases and unusual behavior of the driver are detected. Fairness algorithm is approached in order to improve decision-making and address ethical issues such as privacy issues, and to consider challenges that appear in the wild within AI in healthcare and driving. A healthcare professional will be alerted about any unusual activity, and the driver's location when necessary, is provided in order to enable the healthcare professional to immediately help to the unstable driver. Therefore, using the healthcare AI system allows for accidents to be predicted and thus prevented and lives may be saved based on the built-in AI system inside the vehicle which interacts with the ER system.

CYMar 30, 2021
Towards a New Participatory Approach for Designing Artificial Intelligence and Data-Driven Technologies

Soaad Hossain, Syed Ishtiaque Ahmed

With there being many technical and ethical issues with artificial intelligence (AI) that involve marginalized communities, there is a growing interest for design methods used with marginalized people that may be transferable to the design of AI technologies. Participatory design (PD) is a design method that is often used with marginalized communities for the design of social development, policy, IT and other matters and solutions. However, there are issues with the current PD, raising concerns when it is applied to the design of technologies, including AI technologies. This paper argues for the use of PD for the design of AI technologies, and introduces and proposes a new PD, which we call agile participatory design, that not only can could be used for the design of AI and data-driven technologies, but also overcomes issues surrounding current PD and its use in the design of such technologies.

HCSep 26, 2020
Mental Health and Sensing

Abdul Kawsar Tushar, Muhammad Ashad Kabir, Syed Ishtiaque Ahmed

Mental health is a global epidemic, affecting close to half a billion people worldwide. Chronic shortage of resources hamper detection and recovery of affected people. Effective sensing technologies can help fight the epidemic through early detection, prediction, and resulting proper treatment. Existing and novel technologies for sensing mental health state could address the aforementioned concerns by activating granular tracking of physiological, behavioral, and social signals pertaining to problems in mental health. Our paper focuses on the available methods of sensing mental health problems through direct and indirect measures. We see how active and passive sensing by technologies as well as reporting from relevant sources can contribute toward these detection methods. We also see available methods of therapeutic treatment available through digital means. We highlight a few key intervention technologies that are being developed by researchers to fight against mental illness issues.

CYJul 25, 2020
Combating Misinformation in Bangladesh: Roles and Responsibilities as Perceived by Journalists, Fact-checkers, and Users

Md Mahfuzul Haque, Mohammad Yousuf, Ahmed Shatil Alam et al.

There has been a growing interest within CSCW community in understanding the characteristics of misinformation propagated through computational media, and the devising techniques to address the associated challenges. However, most work in this area has been concentrated on the cases in the western world leaving a major portion of this problem unaddressed that is situated in the Global South. This paper aims to broaden the scope of this discourse by focusing on this problem in the context of Bangladesh, a country in the Global South. The spread of misinformation on Facebook in Bangladesh, a country with a population over 163 million, has resulted in chaos, hate attacks, and killings. By interviewing journalists, fact-checkers, in addition to surveying the general public, we analyzed the current state of verifying misinformation in Bangladesh. Our findings show that most people in the `news audience' want the news media to verify the authenticity of online information that they see online. However, the newspaper journalists say that fact-checking online information is not a part of their job, and it is also beyond their capacity given the amount of information being published online everyday. We further find that the voluntary fact-checkers in Bangladesh are not equipped with sufficient infrastructural support to fill in this gap. We show how our findings are connected to some of the core concerns of CSCW community around social media, collaboration, infrastructural politics, and information inequality. From our analysis, we also suggest several pathways to increase the impact of fact-checking efforts through collaboration, technology design, and infrastructure development.

CYJun 23, 2020
Ethical Analysis on the Application of Neurotechnology for Human Augmentation in Physicians and Surgeons

Soaad Hossain, Syed Ishtiaque Ahmed

With the shortage of physicians and surgeons and increase in demand worldwide due to situations such as the COVID-19 pandemic, there is a growing interest in finding solutions to help address the problem. A solution to this problem would be to use neurotechnology to provide them augmented cognition, senses and action for optimal diagnosis and treatment. Consequently, doing so can negatively impact them and others. We argue that applying neurotechnology for human enhancement in physicians and surgeons can cause injustices, and harm to them and patients. In this paper, we will first describe the augmentations and neurotechnologies that can be used to achieve the relevant augmentations for physicians and surgeons. We will then review selected ethical concerns discussed within literature, discuss the neuroengineering behind using neurotechnology for augmentation purposes, then conclude with an analysis on outcomes and ethical issues of implementing human augmentation via neurotechnology in medical and surgical practice.

SINov 16, 2019
Towards Automated Sexual Violence Report Tracking

Naeemul Hassan, Amrit Poudel, Jason Hale et al.

Tracking sexual violence is a challenging task. In this paper, we present a supervised learning-based automated sexual violence report tracking model that is more scalable, and reliable than its crowdsource based counterparts. We define the sexual violence report tracking problem by considering victim, perpetrator contexts and the nature of the violence. We find that our model could identify sexual violence reports with a precision and recall of 80.4% and 83.4%, respectively. Moreover, we also applied the model during and after the \#MeToo movement. Several interesting findings are discovered which are not easily identifiable from a shallow analysis.

SEOct 3, 2019
Critical Requirements Engineering in Practice

Leticia Duboc, Curtis McCord, Christoph Becker et al.

The design of software systems inevitably enacts normative boundaries around the site of intervention. These boundaries are, in part, a reflection of the values, ethics, power, and politics of the situation and the process of design itself. This paper argues that Requirements Engineering (RE) require more robust frameworks and techniques to navigate the values implicit in systems design work. To this end, we present the findings from a case of action research where we employed Critical Systems Heuristics (CSH), a framework from Critical Systems Thinking (CST) during requirements gathering for Homesound, a system to safeguard elderly people living alone while protecting their autonomy. We use categories from CSH to inform expert interviews and reflection, showing how CSH can be simply combined with RE techniques (such as the Volere template) to explore and reveal the value-judgements underlying requirements.