Taher A. Ghaleb

SE
h-index6
11papers
226citations
Novelty35%
AI Score53

11 Papers

31.7SEMar 18Code
MLmisFinder: A Specification and Detection Approach of Machine Learning Service Misuses

Hadil Ben Amor, Niruthiha Selvanayagam, Manel Abdellatif et al.

Machine Learning (ML) cloud services, offered by leading providers such as Amazon, Google, and Microsoft, enable the integration of ML components into software systems without building models from scratch. However, the rapid adoption of ML services, coupled with the growing complexity of business requirements, has led to widespread misuses, compromising the quality, maintainability, and evolution of ML service-based systems. Though prior research has studied patterns and antipatterns in service-based and ML-based systems separately, automatic detection of ML service misuses remains a challenge. In this paper, we propose MLmisFinder, an automatic approach to detect ML service misuses in software systems, aiming to identify instances of improper use of ML services to help developers properly integrate ML components in ML service-based systems. We propose a metamodel that captures the data needed to detect misuses in ML service-based systems and apply a set of rule-based detection algorithms for seven misuse types. We evaluated MLmisFinder on 107 software systems collected from open-source GitHub repositories and compared it with a state-of-the-art baseline. Our results show that MLmisFinder effectively detects ML service misuses, achieving an average precision of 96.7\% and recall of 97\%, outperforming the state-of-the-art baseline. MLmisFinder also scaled efficiently to detect misuses across 817 ML service-based systems and revealed that such misuses are widespread, especially in areas such as data drift monitoring and schema validation.

28.8SEApr 3Code
Android Instrumentation Testing in Continuous Integration: Practices, Patterns, and Performance

Hamid Parsazadeh, Taher A. Ghaleb, Safwat Hassan

Android instrumentation tests (end-to-end tests that run on a device or emulator) can catch problems that simpler tests miss. However, running these tests automatically in continuous integration (CI) is often difficult because emulator setup is fragile and configurations tend to drift over time. We study how open-source Android apps run instrumentation tests in CI by analyzing 4,518 repositories that use CI (snapshot: Aug. 10, 2025). We examine CI workflow files, scripts, and build configurations to identify cases where device setup is defined in Gradle (e.g., Gradle Managed Devices). Our results answer three questions about adoption, evolution, and outcomes. First, only about one in ten repositories (481/4,518; 10.6%) run instrumentation tests in CI, typically using either reusable community components or repository-specific custom scripts to set up emulators. Second, these setups usually stay the same over time; when changes happen, projects tend to move from custom scripts toward reusable community components. Third, we study why projects change their CI setup by analyzing their commits, pull requests, and issue messages. We evaluate how different setup styles perform using GitHub Actions run- and step-level metadata (e.g., outcomes, duration, reruns, and queue delay). We find that teams often change approaches to expand test coverage, and that each approach fits different needs: community-based setups are typically the most reliable and efficient for everyday checks on new code, third-party device labs suit scheduled regression testing but can be costlier and fail more often, and custom scripting provides flexibility but is associated with more reruns.

23.9SEApr 4Code
Mapping GitHub Sponsorships: A Longitudinal Observatory for Open-Source Sustainability

Rylan Hiltz, Taher A. Ghaleb

Financial sustainability is vital for open-source software, yet systematic research on funding remains limited. GitHub Sponsors, launched in 2019 as a direct developer-to-developer funding model, lacks bulk API access, hindering large-scale studies. This paper introduces a live, continuously operating observatory for tracking and analyzing the GitHub Sponsors ecosystem. The observatory performs priority-based graph traversal with daily incremental updates, real-time normalization, and exposes collected data through an interactive dashboard and analysis-ready CSV exports. A sample dataset collected during a 72-hour run captures 49K+ users across 144 countries and serves as an example of the tool's output, not a fixed deliverable. An interactive dashboard (https://github-sponsorships.com) enables practitioners and researchers to explore sponsorship patterns, filter by geography and demographics, and benchmark against funded peers. Preliminary results on the sample show strong participation asymmetries and geographic concentration, suggesting several research directions.

SEJan 28Code
LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis

Marcus Emmanuel Barnes, Taher A. Ghaleb, Safwat Hassan

Logs are essential for understanding Continuous Integration (CI) behavior, particularly for diagnosing build failures and performance regressions. Yet their growing volume and verbosity make both manual inspection and automated analysis increasingly costly, time-consuming, and environmentally costly. While prior work has explored log compression, anomaly detection, and LLM-based log analysis, most efforts target structured system logs rather than the unstructured, noisy, and verbose logs typical of CI workflows. We present LogSieve, a lightweight, RCA-aware and semantics-preserving log reduction technique that filters low-information lines while retaining content relevant to downstream reasoning. Evaluated on CI logs from 20 open-source Android projects using GitHub Actions, LogSieve achieves an average 42% reduction in lines and 40% reduction in tokens with minimal semantic loss. This pre-inference reduction lowers computational cost and can proportionally reduce energy use (and associated emissions) by decreasing the volume of data processed during LLM inference. Compared with structure-first baselines (LogZip and random-line removal), LogSieve preserves much higher semantic and categorical fidelity (Cosine = 0.93, GPTScore = 0.93, 80% exact-match accuracy). Embedding-based classifiers automate relevance detection with near-human accuracy (97%), enabling scalable and sustainable integration of semantics-aware filtering into CI workflows. LogSieve thus bridges log management and LLM reasoning, offering a practical path toward greener and more interpretable CI automation.

19.3SEApr 3
A Vision for Context-Aware CI Adoption Decisions

Osamah H. Alaini, Taher A. Ghaleb

Continuous Integration (CI) is widely adopted in modern software development, yet adoption decisions are often made without systematic consideration of project context. Platforms such as GitHub Actions lower the barrier to CI adoption but provide limited support for grounding adoption decisions in project characteristics, leading to redundant services, unmaintained workflows, and costly migrations. Existing research and tooling primarily focus on improving CI after adoption, offering little guidance for assessing suitability before adoption. As a result, CI is frequently treated as universally beneficial rather than context-dependent. This paper envisions a shift from default CI adoption to deliberate, context-aware decision-making. We propose an AI-enabled framework that assesses whether projects are likely to benefit from CI, recommends suitable CI services based on project characteristics, and provides configuration guidance tailored to project needs. We outline a research agenda combining developer studies, large-scale repository mining, and recommendation system design to enable informed CI adoption decisions and prevent inefficiencies before they occur.

13.6SEMar 19
The Promise and Reality of Continuous Integration Caching: An Empirical Study of Travis CI Builds

Taher A. Ghaleb, Daniel Alencar da Costa, Ying Zou

Continuous Integration (CI) provides early feedback by automatically building software, but long build durations can hinder developer productivity. CI services use caching to speed up builds by reusing infrequently changing artifacts, yet little is known about how caching is adopted in practice and what challenges it entails. In this paper, we conduct a large-scale empirical study of CI caching in Travis CI, analyzing 513,384 builds from 1,279 GitHub projects. We find that only 30% of projects adopt CI caching, and early adopters are typically more mature, with more dependencies, commits, and longer CI lifespans. To understand non-adoption, we submit pull requests enabling caching in non-adopting projects, and nearly half are accepted or merged. Developer feedback indicates that non- or late adoption mainly results from limited awareness of CI caching support. We further study cache maintenance and identify five common activities, performed by 24% of cache-enabled projects. While one-third of projects see substantial build-time reductions, cache uploads occur in 97% of builds, and 27% of projects contain stale cached artifacts. An analysis of reported caching issues shows developers mainly struggle with corrupted or outdated caches and request broader caching features. Overall, CI caching does not benefit all projects, requires ongoing maintenance, and is more complex in practice than many developers expect.

53.0SEMay 8
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

Marcus Emmanuel Barnes, Taher A. Ghaleb, Safwat Hassan

AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human--agent coordination.

63.3SEMay 3
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing

Edward Abrokwah, Taher A. Ghaleb

GitHub Actions (GHA) CI workflows are critical infrastructure, but current tooling offers only syntactic or heuristic checks and does not enforce documented best practices for security, maintainability, or performance. Consequently, issues like over-privileged permissions, weak secrets management, and missing failure notifications remain undetected in real-world pipelines. This paper proposes a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and assesses Large Language Models (LLMs) for scalable compliance auditing. On 95 real-world Java workflows (2,850 assessments) using four open-weight LLMs, we find only fair agreement (Fleiss' kappa = 0.28), with systematic disagreement on structural reasoning and security-sensitive judgments. To address this, we introduce a multi-tier adjudication framework in which GPT 5 resolves model conflicts before targeted manual review, reducing verification effort by 81% while retaining 87% agreement with expert judgment. At scale, it reveals major compliance gaps: overall compliance is 28%, dropping to 4% for permission controls; Security (26%) lags far behind Clarity (68%). Our results show that LLMs enable scalable compliance measurement but cannot replace experts, highlighting the need for hybrid human-AI auditing and providing empirical benchmarks and guidance for defensible GHA workflow audits.

SEOct 13, 2025
Task-Aware Reduction for Scalable LLM-Database Systems

Marcus Emmanuel Barnes, Taher A. Ghaleb, Safwat Hassan · utoronto

Large Language Models (LLMs) are increasingly applied to data-intensive workflows, from database querying to developer observability. Yet the effectiveness of these systems is constrained by the volume, verbosity, and noise of real-world text-rich data such as logs, telemetry, and monitoring streams. Feeding such data directly into LLMs is costly, environmentally unsustainable, and often misaligned with task objectives. Parallel efforts in LLM efficiency have focused on model- or architecture-level optimizations, but the challenge of reducing upstream input verbosity remains underexplored. In this paper, we argue for treating the token budget of an LLM as an attention budget and elevating task-aware text reduction as a first-class design principle for language -- data systems. We position input-side reduction not as compression, but as attention allocation: prioritizing information most relevant to downstream tasks. We outline open research challenges for building benchmarks, designing adaptive reduction pipelines, and integrating token-budget--aware preprocessing into database and retrieval systems. Our vision is to channel scarce attention resources toward meaningful signals in noisy, data-intensive workflows, enabling scalable, accurate, and sustainable LLM--data integration.

SEDec 23, 2021
Flakify: A Black-Box, Language Model-based Predictor for Flaky Tests

Sakina Fatima, Taher A. Ghaleb, Lionel Briand

Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. The state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, require access to production code, which is not always available to software test engineers. Therefore, in this paper, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach using two different evaluation procedures: cross-validation and per-project validation. Flakify achieved high F1-scores on both datasets using cross-validation and per-project validation, and surpassed FlakeFlagger by 10 and 18 percentage points in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages. Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.

SEJun 25, 2021
Test Case Selection and Prioritization Using Machine Learning: A Systematic Literature Review

Rongqi Pan, Mojtaba Bagherzadeh, Taher A. Ghaleb et al.

Regression testing is an essential activity to assure that software code changes do not adversely affect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects, which increases the frequency of running software builds, running all tests can be time-consuming and resource-intensive. To alleviate that problem, Test case Selection and Prioritization (TSP) techniques have been proposed to improve regression testing by selecting and prioritizing test cases in order to provide early feedback to developers. In recent years, researchers have relied on Machine Learning (ML) techniques to achieve effective TSP (ML-based TSP). Such techniques help combine information about test cases, from partial and imperfect sources, into accurate prediction models. This work conducts a systematic literature review focused on ML-based TSP techniques, aiming to perform an in-depth analysis of the state of the art, thus gaining insights regarding future avenues of research. To that end, we analyze 29 primary studies published from 2006 to 2020, which have been identified through a systematic and documented process. This paper addresses five research questions addressing variations in ML-based TSP techniques and feature sets for training and testing ML models, alternative metrics used for evaluating the techniques, the performance of techniques, and the reproducibility of the published studies. We summarize the results related to our research questions in a high-level summary that can be used as a taxonomy for classifying future TSP studies.