Meltem Aksoy

h-index2

6papers

21citations

Novelty28%

AI Score44

Ranked #73,209 of 201,018 authors (top 36%)#140 in CY (top 12%)

6 Papers

SEJul 25, 2025Code

Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects

Igli Begolli, Meltem Aksoy, Daniel Neider

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.

LGApr 29

ConformaDecompose: Explaining Uncertainty via Calibration Localization

Fatima Rabia Yapicioglu, Meltem Aksoy, Alberto Rigenti et al.

Conformal Prediction provides distribution-free prediction intervals with guaranteed coverage, but its reliance on a single global calibration threshold obscures the sources of uncertainty at the instance level. In particular, it conflates irreducible noise with uncertainty induced by heterogeneous training data (aleatoric), model limitations, or calibration mismatch (epistemic), offering little insight into why an interval is wide or whether it could be reduced. We introduce an uncertainty-aware explainability framework that analyses the reducibility of calibration-induced epistemic conformal uncertainty via progressive calibration localisation for regression tasks. The approach is diagnostic rather than causal: it does not estimate true aleatoric or epistemic uncertainty, but explains how conformal intervals contract and stabilise as calibration support is localised around a test instance. Across benchmarks and real-world data, absolute reducible uncertainty aligns with epistemic proxies, while its relative contribution varies by task, revealing regimes hidden by interval width. This instance-level view complements conformal uncertainty, enhancing interpretability without altering the predictor or coverage.

CYNov 3, 2025

A Detailed Study on LLM Biases Concerning Corporate Social Responsibility and Green Supply Chains

Greta Ontrup, Annika Bush, Markus Pauly et al.

Organizations increasingly use Large Language Models (LLMs) to improve supply chain processes and reduce environmental impacts. However, LLMs have been shown to reproduce biases regarding the prioritization of sustainable business strategies. Thus, it is important to identify underlying training data biases that LLMs pertain regarding the importance and role of sustainable business and supply chain practices. This study investigates how different LLMs respond to validated surveys about the role of ethics and responsibility for businesses, and the importance of sustainable practices and relations with suppliers and customers. Using standardized questionnaires, we systematically analyze responses generated by state-of-the-art LLMs to identify variations. We further evaluate whether differences are augmented by four organizational culture types, thereby evaluating the practical relevance of identified biases. The findings reveal significant systematic differences between models and demonstrate that organizational culture prompts substantially modify LLM responses. The study holds important implications for LLM-assisted decision-making in sustainability contexts.

CLDec 25, 2024

Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models

Meltem Aksoy

Large language models (LLMs) have become integral tools in diverse domains, yet their moral reasoning capabilities across cultural and linguistic contexts remain underexplored. This study investigates whether multilingual LLMs, such as GPT-3.5-Turbo, GPT-4o-mini, Llama 3.1, and MistralNeMo, reflect culturally specific moral values or impose dominant moral norms, particularly those rooted in English. Using the updated Moral Foundations Questionnaire (MFQ-2) in eight languages, Arabic, Farsi, English, Spanish, Japanese, Chinese, French, and Russian, the study analyzes the models' adherence to six core moral foundations: care, equality, proportionality, loyalty, authority, and purity. The results reveal significant cultural and linguistic variability, challenging the assumption of universal moral consistency in LLMs. Although some models demonstrate adaptability to diverse contexts, others exhibit biases influenced by the composition of the training data. These findings underscore the need for culturally inclusive model development to improve fairness and trust in multilingual AI systems.

CYMay 20, 2025

Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Annika Bush, Meltem Aksoy, Markus Pauly et al.

As organizations increasingly rely on AI systems for decision support in sustainability contexts, it becomes critical to understand the inherent biases and perspectives embedded in Large Language Models (LLMs). This study systematically investigates how five state-of-the-art LLMs -- Claude, DeepSeek, GPT, LLaMA, and Mistral - conceptualize sustainability and its relationship with AI. We administered validated, psychometric sustainability-related questionnaires - each 100 times per model -- to capture response patterns and variability. Our findings revealed significant inter-model differences: For example, GPT exhibited skepticism about the compatibility of AI and sustainability, whereas LLaMA demonstrated extreme techno-optimism with perfect scores for several Sustainable Development Goals (SDGs). Models also diverged in attributing institutional responsibility for AI and sustainability integration, a results that holds implications for technology governance approaches. Our results demonstrate that model selection could substantially influence organizational sustainability strategies, highlighting the need for awareness of model-specific biases when deploying LLMs for sustainability-related decision-making.

HCFeb 9

Campus AI vs. Commercial AI: Comparing How Students and Employees Perceive their University's LLM Chatbot vs. ChatGPT

Leon Hannig, Annika Bush, Meltem Aksoy et al.

As the use of LLM chatbots by students and researchers becomes more prevalent, universities are pressed to develop AI strategies. One strategy that many universities pursue is to customize pre-trained LLM as-a-service (LLMaaS). While most studies on LLMaaS chatbots prioritize technical adaptations, we focus on psychological effects of user-salient customizations, such as interface changes. We assume that such customizations influence users' perception of the system and are therefore important in guiding safe and appropriate use. In a field study, we examine how students and employees (N = 526) at a German university perceive and use their institution's customized LLMaaS chatbot compared to ChatGPT. Participants using both systems (n = 116) reported greater trust, higher perceived privacy and less experienced hallucinations with their university's customized LLMaaS chatbot in contrast to ChatGPT. We discuss theoretical implications for research on calibrated trust, and offer guidance on the design and deployment of LLMaaS chatbots.