Fareed Zaffar

CL
h-index33
6papers
21citations
Novelty40%
AI Score40

6 Papers

19.6SEApr 21
Revisiting Code Debloating with Ground Truth-based Evaluation

Muhammad Bilal, Moiz Ali, Mohit Kumar et al.

Program debloating aims to remove unused code to reduce performance overhead, attack surfaces, and maintenance costs. Over time, debloating has evolved across multiple layers (container, library, and application), each building on the principles of application-level debloating. Despite its central role, application-level debloating continues to rely on imperfect proxies for measuring performance, such as test-case-driven evaluation for correctness, code size for runtime efficiency, and gadget count reduction for estimating security posture. While there is widespread skepticism about using such imperfect proxies, the community still lacks standardized methodologies or benchmarks to assess the true performance of application-level software debloating. This experience paper aims to address the gap. We revisit the foundations of application-level debloating through a ground-truth-based evaluation paradigm. Our analysis of eight state-of-the-art debloaters - Blade, Chisel, Cov, CovA, Lmcas, Trimmer, Occam, and Razor - uncovers insights previously unattainable through traditional evaluations. These tools collectively span the spectrum of source-to-source, IR-to-IR, and binary-to-binary transformation paradigms, characterizing a holistic reassessment across abstraction levels. Our analysis reveals that while dynamic analysis-based tools often remove up to 94% of code that should be retained, static analysis-based approaches exhibit the opposite behavior, showing high false retention rates due to coarse-grained dependency over-approximation. Additionally, static analyses may add code by introducing specialized variants of functions. False retentions and removals not only cause functional incorrectness but may also lead to systematic inconsistency, robustness failures, and exploitable vulnerabilities.

CLSep 18, 2024
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan, Nouar AlDahoul, Moiz Ali et al.

Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model's overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

CYOct 29, 2024Code
A Longitudinal Analysis of Racial and Gender Bias in New York Times and Fox News Images and Articles

Hazem Ibrahim, Nouar AlDahoul, Syed Mustafa Ali Abbasi et al.

The manner in which different racial and gender groups are portrayed in news coverage plays a large role in shaping public opinion. As such, understanding how such groups are portrayed in news media is of notable societal value, and has thus been a significant endeavour in both the computer and social sciences. Yet, the literature still lacks a longitudinal study examining both the frequency of appearance of different racial and gender groups in online news articles, as well as the context in which such groups are discussed. To fill this gap, we propose two machine learning classifiers to detect the race and age of a given subject. Next, we compile a dataset of 123,337 images and 441,321 online news articles from New York Times (NYT) and Fox News (Fox), and examine representation through two computational approaches. Firstly, we examine the frequency and prominence of appearance of racial and gender groups in images embedded in news articles, revealing that racial and gender minorities are largely under-represented, and when they do appear, they are featured less prominently compared to majority groups. Furthermore, we find that NYT largely features more images of racial minority groups compared to Fox. Secondly, we examine both the frequency and context with which racial minority groups are presented in article text. This reveals the narrow scope in which certain racial groups are covered and the frequency with which different groups are presented as victims and/or perpetrators in a given conflict. Taken together, our analysis contributes to the literature by providing two novel open-source classifiers to detect race and age from images, and shedding light on the racial and gender biases in news articles from venues on opposite ends of the American political spectrum.

CLMay 22, 2025
Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

Essa Jan, Moiz Ali, Muhammad Saram Hassan et al.

As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

LGSep 15, 2021
Avengers Ensemble! Improving Transferability of Authorship Obfuscation

Muhammad Haroon, Fareed Zaffar, Padmini Srinivasan et al.

Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document's author. Recent work has focused on authorship obfuscation approaches that rely on black-box access to an attribution classifier to evade attribution while preserving semantics. However, to be useful under a realistic threat model, it is important that these obfuscation approaches work well even when the adversary's attribution classifier is different from the one used internally by the obfuscator. Unfortunately, existing authorship obfuscation approaches do not transfer well to unseen attribution classifiers. In this paper, we propose an ensemble-based approach for transferable authorship obfuscation. Our experiments show that if an obfuscator can evade an ensemble attribution classifier, which is based on multiple base attribution classifiers, it is more likely to transfer to different attribution classifiers. Our analysis shows that ensemble-based authorship obfuscation achieves better transferability because it combines the knowledge from each of the base attribution classifiers by essentially averaging their decision boundaries.

CYMay 7, 2015
Characterizing Key Stakeholders in an Online Black-Hat Marketplace

Shehroze Farooqi, Muhammad Ikram, Emiliano De Cristofaro et al.

Over the past few years, many black-hat marketplaces have emerged that facilitate access to reputation manipulation services such as fake Facebook likes, fraudulent search engine optimization (SEO), or bogus Amazon reviews. In order to deploy effective technical and legal countermeasures, it is important to understand how these black-hat marketplaces operate, shedding light on the services they offer, who is selling, who is buying, what are they buying, who is more successful, why are they successful, etc. Toward this goal, in this paper, we present a detailed micro-economic analysis of a popular online black-hat marketplace, namely, SEOClerks.com. As the site provides non-anonymized transaction information, we set to analyze selling and buying behavior of individual users, propose a strategy to identify key users, and study their tactics as compared to other (non-key) users. We find that key users: (1) are mostly located in Asian countries, (2) are focused more on selling black-hat SEO services, (3) tend to list more lower priced services, and (4) sometimes buy services from other sellers and then sell at higher prices. Finally, we discuss the implications of our analysis with respect to devising effective economic and legal intervention strategies against marketplace operators and key users.