Serkan Ayvaz

AI
h-index14
17papers
103citations
Novelty33%
AI Score49

17 Papers

AIJul 5, 2023
Beyond Known Reality: Exploiting Counterfactual Explanations for Medical Research

Toygar Tanyel, Serkan Ayvaz, Bilgin Keserci

The field of explainability in artificial intelligence (AI) has witnessed a growing number of studies and increasing scholarly interest. However, the lack of human-friendly and individual interpretations in explaining the outcomes of machine learning algorithms has significantly hindered the acceptance of these methods by clinicians in their research and clinical practice. To address this issue, our study uses counterfactual explanations to explore the applicability of "what if?" scenarios in medical research. Our aim is to expand our understanding of magnetic resonance imaging (MRI) features used for diagnosing pediatric posterior fossa brain tumors beyond existing boundaries. In our case study, the proposed concept provides a novel way to examine alternative decision-making scenarios that offer personalized and context-specific insights, enabling the validation of predictions and clarification of variations under diverse circumstances. Additionally, we explore the potential use of counterfactuals for data augmentation and evaluate their feasibility as an alternative approach in our medical research case. The results demonstrate the promising potential of using counterfactual explanations to improve AI-driven methods in clinical research.

12.9IRMay 18
PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

Riccardo Terrenzi, Matteo Falconi, Serkan Ayvaz et al.

The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.

CYFeb 26
GenAI Integration into Engineering Education: A Case Study of an Introductory Undergraduate Engineering Course

Kadir Kozan, Ozgur Keles, Sihan Jian et al.

GenAI has a potential to enhance the learning and teaching processes in engineering education. For instance, GenAI feedback on students' task performance can be effective depending on when such feedback is provided. However, little is known about how engineering faculty and instructors discover such potential within the scope of their instruction when they try out the technology for the first time. To this end, this study purported to describe an engineering instructor's and seven teaching assistants' initial experiences of integrating GenAI into their undergraduate engineering course and the corresponding changes in students' formative exercise performance. An embedded descriptive single case study design was employed. The corresponding research data included four interviews conducted at the beginning, middle and end of an academic semester, and students' formative exercise performance. Overall, after GenAI integration, students' formative exercise performance increased, and a critical and reflective practice of learning about how to integrate GenAI into instruction provided informative insights. Still, technology integration stayed at the level of replacing other instructional methods or increasing the efficiency of solving coding problems. It turned out to be exciting and surprising for students to be able to use GenAI in course work even though their use of the technology weakened over time. Our findings suggest that engineering teaching staff's initial experimental experiences with GenAI integration can be informative and provide context-specific practical insights. Therefore, it is reasonable for higher education institutions to encourage such experiences especially when there is a lot of unknown regarding an emerging technology.

12.7AIMay 14
Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

Riccardo Terrenzi, Maximilian von Zastrow, Serkan Ayvaz

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

47.4CRMay 11
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.

24.2AIMay 11
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Phongsakon Mark Konrad, Tim Lukas Adam, Ane Cathrine Holst Merrild et al.

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

37.3SEApr 5
Architecture Without Architects: How AI Coding Agents Shape Software Architecture

Phongsakon Mark Konrad, Tim Lukas Adam, Riccardo Terrenzi et al.

AI coding agents select frameworks, scaffold infrastructure, and wire integrations, often in seconds. These are architectural decisions, yet almost no one reviews them as such. We identify five mechanisms by which agents make implicit architectural choices and propose six prompt-architecture coupling patterns that map natural-language prompt features to the infrastructure they require. The patterns range from contingent couplings (structured output validation) that may weaken as models improve to fundamental ones (tool-call orchestration) that persist regardless of model capability. An illustrative demonstration confirms that prompt wording alone produces structurally different systems for the same task. We term the phenomenon vibe architecting, architecture shaped by prompts rather than deliberate design, and outline review practices, decision records, and tooling to bring these hidden decisions under governance.

51.4IRMar 28
A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

Riccardo Terrenzi, Phongsakon Mark Konrad, Tim Lukas Adam et al.

Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software-architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider-authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo-queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct agent and a multi-agent horizontal architecture with Feedback Control. Their quality-attribute tradeoffs are analyzed with respect to modifiability, observability, performance, and governance. An evaluation framework comprising seven system variants is defined to isolate the contribution of each architectural decision. The architecture is presented as an extensible reference design for the software architecture community, incorporating explicit governance tactics to bound and audit nondeterministic LLM components.

29.3SEApr 7
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Tim Lukas Adam, Phongsakon Mark Konrad, Riccardo Terrenzi et al.

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

3.3IVApr 5
Non-Destructive Prediction of Fruit Ripeness and Firmness Using Hyperspectral Imaging and Lightweight Machine Learning Models

Phongsakon Mark Konrad, Casper Kunstmann-Olsen, Jacek Fiutowski et al.

Post-harvest fruit quality assessment is essential for reducing food waste, yet reliable non-destructive methods typically depend on expensive hyperspectral cameras and computationally intensive deep learning models. These systems typically require GPU resources, large-scale training data, and domain expertise, limiting their feasibility for many real-world agricultural settings. This study systematically evaluates 20 classical machine learning algorithms on hyperspectral imaging data for simultaneous ripeness classification and firmness prediction across five fruit species, using cross-validated experimental design with Bayesian hyperparameter optimization. Data preprocessing strategy, particularly class balancing and spectral transformations, contributes as much to prediction accuracy as algorithm choice. Our results show that tree-based machine learning models can outperform state-of-the-art deep earning models reported in Fruit-HSNet. Moreover, the findings indicate that only three visible-range wavelengths are needed to recover over 94% of full-spectrum accuracy, demonstrating that low-cost multispectral sensors combined with lightweight machine learning models can serve as practical alternatives to expensive hyperspectral cameras and complex deep learning approaches for practical fruit quality sorting.

CLDec 4, 2023
Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection

Toygar Tanyel, Besher Alkurdi, Serkan Ayvaz

With the proliferation of social media, there has been a sharp increase in offensive content, particularly targeting vulnerable groups, exacerbating social problems such as hatred, racism, and sexism. Detecting offensive language use is crucial to prevent offensive language from being widely shared on social media. However, the accurate detection of irony, implication, and various forms of hate speech on social media remains a challenge. Natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. Unfortunately, manually creating such datasets is both costly and error-prone. Additionally, the presence of human-bias in offensive language datasets is a major concern for deep learning models. In this paper, we propose a linguistic data augmentation approach to reduce bias in labeling processes, which aims to mitigate the influence of human bias by leveraging the power of machines to improve the accuracy and fairness of labeling processes. This approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.

CVSep 7, 2025
Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Phongsakon Mark Konrad, Andrei-Alexandru Popa, Yaser Sabzehmeidani et al.

Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.

LGMar 19, 2024
A Big Data Analytics System for Predicting Suicidal Ideation in Real-Time Based on Social Media Streaming Data

Mohamed A. Allayla, Serkan Ayvaz

Online social media platforms have recently become integral to our society and daily routines. Every day, users worldwide spend a couple of hours on such platforms, expressing their sentiments and emotional state and contacting each other. Analyzing such huge amounts of data from these platforms can provide a clear insight into public sentiments and help detect their mental status. The early identification of these health condition risks may assist in preventing or reducing the number of suicide ideation and potentially saving people's lives. The traditional techniques have become ineffective in processing such streams and large-scale datasets. Therefore, the paper proposed a new methodology based on a big data architecture to predict suicidal ideation from social media content. The proposed approach provides a practical analysis of social media data in two phases: batch processing and real-time streaming prediction. The batch dataset was collected from the Reddit forum and used for model building and training, while streaming big data was extracted using Twitter streaming API and used for real-time prediction. After the raw data was preprocessed, the extracted features were fed to multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP. We conducted various experiments using various feature-extraction techniques with different testing scenarios. The experimental results of the batch processing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF with MLP classifier provided high performance for classifying suicidal ideation, with an accuracy of 93.47%, and then applied for real-time streaming prediction phase.

CRJul 9, 2019
Private key encryption and recovery in blockchain

Mehmet Aydar, Salih Cemil Cetin, Serkan Ayvaz et al.

The disruptive technology of blockchain can deliver secure solutions without the need for a central authority. In blockchain protocols, assets that belong to a participant are controlled through the private key of an asymmetric key pair that is owned by the participant. Although, this lets blockchain network participants to have sovereignty on their assets, it comes with the responsibility of managing their own keys. Currently, there exists two major bottlenecks in managing keys; $a)$ users don't have an efficient and secure way to store their keys, $b)$ no efficient recovery mechanism exists in case the keys are lost. In this study, we propose secure methods to efficiently store and recover keys. For the first, we introduce an efficient encryption mechanism to securely encrypt and decrypt the private key using the owner's biometric signature. For the later, we introduce an efficient recovery mechanism using biometrics and secret sharing scheme. By applying the proposed key encryption and recovery mechanism, asset owners are able to securely store their keys on their devices and recover the keys in case they are lost.

CRJun 24, 2019
Towards a Blockchain based digital identity verification, record attestation and record sharing system

Mehmet Aydar, Serkan Ayvaz, Salih Cemil Cetin

The Covid-19 pandemic has made individuals and organizations to rethink the way of handling identity verification and credentials sharing particularly in quarantined situations. In this study, we investigate the inefficiencies of traditional identity systems, and discuss how a proper implementation of Blockchain technology would result in safer, more secure, privacy respecting and remote friendly identity systems. As a result, we propose a Blockchain based framework for digital identity verification, record attestation and record sharing, and we explain the framework in details with certain use cases. Our proposed framework promotes individuals to fully control their identity data and govern the level of the identity data sharing.

AIJul 12, 2017
Using RDF Summary Graph For Keyword-based Semantic Searches

Serkan Ayvaz, Mehmet Aydar

The Semantic Web began to emerge as its standards and technologies developed rapidly in the recent years. The continuing development of Semantic Web technologies has facilitated publishing explicit semantics with data on the Web in RDF data model. This study proposes a semantic search framework to support efficient keyword-based semantic search on RDF data utilizing near neighbor explorations. The framework augments the search results with the resources in close proximity by utilizing the entity type semantics. Along with the search results, the system generates a relevance confidence score measuring the inferred semantic relatedness of returned entities based on the degree of similarity. Furthermore, the evaluations assessing the effectiveness of the framework and the accuracy of the results are presented.

DBMay 31, 2017
Dynamic Discovery of Type Classes and Relations in Semantic Web Data

Serkan Ayvaz, Mehmet Aydar

The continuing development of Semantic Web technologies and the increasing user adoption in the recent years have accelerated the progress incorporating explicit semantics with data on the Web. With the rapidly growing RDF (Resource Description Framework) data on the Semantic Web, processing large semantic graph data have become more challenging. Constructing a summary graph structure from the raw RDF can help obtain semantic type relations and reduce the computational complexity for graph processing purposes. In this paper, we addressed the problem of graph summarization in RDF graphs, and we proposed an approach for building summary graph structures automatically from RDF graph data. Moreover, we introduced a measure to help discover optimum class dissimilarity thresholds and an effective method to discover the type classes automatically. In future work, we plan to investigate further improvement options on the scalability of the proposed method.