Zubair Shafiq

h-index21

22papers

2,702citations

Novelty51%

AI Score61

Ranked #2,096 of 194,257 authors (top 1%)#25 in CR (top 1%)

22 Papers

7.4CRApr 8Code

Understanding Data Collection, Brokerage, and Spam in the Lead Marketing Ecosystem

Yash Vekaria, Nurullah Demir, Konrad Kollnig et al.

The lead marketing ecosystem enables collection, sale, and use of personal data submitted via web forms to deliver personalized quotes in high-value verticals such as insurance. Despite its scale and sensitivity of the collected data, this ecosystem remains largely unexplored by the research community. We present the first empirical study of privacy and spam risks in lead marketing, developing an end-to-end measurement framework to trace data flows from data collection to consumer contact. Our setup instruments over 100 health-related lead-generation websites and monitors 200 controlled phone numbers and email addresses to understand downstream marketing practices. We observe sharing of highly personal and sensitive health information to more than 70 distinct third parties on these lead generation websites. By purchasing our own and other organic leads from three major lead platforms, we uncover deceptive brokerage practices, where consumer data is sold to unvetted buyers and often augmented or fabricated with attributes such as health status and weight. We received a total of over 8,000 telemarketing phone calls, 600 text messages, and 200 emails, where calls often began within seconds of form submission. Many campaigns relied on VoIP-based neighbor spoofing and high-frequency dialing, at times rendering phones unusable. Our experiments with phone and email opt-outs suggest phone-based opt-outs to help the most, although all were ineffective at completely stopping marketing communications. Analysis of 7,432 Better Business Bureau (BBB) complaints and reviews corroborates these findings from the consumer perspective. Overall, our results reveal a highly interconnected and non-compliant lead marketing ecosystem that aggressively monetizes sensitive consumer data.

8.3CRApr 30Code

SST-Guard: Detecting and Characterizing Server-Side Google Analytics in the Wild

Muhammad Jazlan, Alexander Gamero-Garrido, Zubair Shafiq et al.

As web browsers increasingly restrict client-side tracking, the web tracking ecosystem is shifting from client-side to server-side tracking (SST). In SST, the browser sends tracking requests to an intermediate endpoint, which then forwards them to the tracker's endpoint, eliminating direct client-to-tracker requests. As a result, existing tracking protections that block requests to known tracker endpoints are rendered ineffective. In this paper, we investigate server-side implementation of Google Analytics, the most widely deployed third-party tracking service on the web today. We also present SST-Guard, a multi-modal, browser-based system for detecting and blocking server-side Google Analytics (sGA). Our key insight is that even when the tracker's endpoints change, sGA must necessarily still collect and share the same semantic information as client-side Google Analytics (e.g., identifiers, event metadata). Therefore, rather than detecting requests to known Google Analytics endpoints, SST-Guard aims to detect underlying artifacts of collection and sharing of these semantic values to any arbitrary endpoint. Operationalizing this insight is challenging because real-world sGA deployments commonly customize endpoints and obfuscate URLs/payloads. SST-Guard addresses this challenge using a value-template approach that employs regular expressions to match semantic value patterns across multiple modalities: network requests, cookies, and the window object. We validate SST-Guard on Tranco top-10k websites, detecting 4.02\% (403) sGA domains with over 93\% accuracy across three modalities, with network request classifier demonstrating the highest accuracy (99.8\%). By deploying SST-Guard in the wild, we find 4.21\% (6,314) of Tranco top-150k websites using sGA.

32.0CLMar 21, 2022Code

On The Robustness of Offensive Language Classifiers

Jonathan Rusert, Zubair Shafiq, Padmini Srinivasan

Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text.

7.1CRAug 7, 2023Code

PURL: Safe and Effective Sanitization of Link Decoration

Shaoor Munir, Patrick Lee, Umar Iqbal et al.

While privacy-focused browsers have taken steps to block third-party cookies and mitigate browser fingerprinting, novel tracking techniques that can bypass existing countermeasures continue to emerge. Since trackers need to share information from the client-side to the server-side through link decoration regardless of the tracking technique they employ, a promising orthogonal approach is to detect and sanitize tracking information in decorated links. To this end, we present PURL (pronounced purel-l), a machine-learning approach that leverages a cross-layer graph representation of webpage execution to safely and effectively sanitize link decoration. Our evaluation shows that PURL significantly outperforms existing countermeasures in terms of accuracy and reducing website breakage while being robust to common evasion techniques. PURL's deployment on a sample of top-million websites shows that link decoration is abused for tracking on nearly three-quarters of the websites, often to share cookies, email addresses, and fingerprinting information.

1.4CLMar 22, 2022Code

A Girl Has A Name, And It's ... Adversarial Authorship Attribution for Deobfuscation

Wanyue Zhai, Jonathan Rusert, Zubair Shafiq et al.

Recent advances in natural language processing have enabled powerful privacy-invasive authorship attribution. To counter authorship attribution, researchers have proposed a variety of rule-based and learning-based text obfuscation approaches. However, existing authorship obfuscation approaches do not consider the adversarial threat model. Specifically, they are not evaluated against adversarially trained authorship attributors that are aware of potential obfuscation. To fill this gap, we investigate the problem of adversarial authorship attribution for deobfuscation. We show that adversarially trained authorship attributors are able to degrade the effectiveness of existing obfuscators from 20-30% to 5-10%. We also evaluate the effectiveness of adversarial training when the attributor makes incorrect assumptions about whether and which obfuscator was used. While there is a a clear degradation in attribution accuracy, it is noteworthy that this degradation is still at or above the attribution accuracy of the attributor that is not adversarially trained at all. Our results underline the need for stronger obfuscation approaches that are resistant to deobfuscation

5.3LGAug 16, 2023

Benchmarking Adversarial Robustness of Compressed Deep Learning Models

Brijesh Vora, Kartik Patwari, Syed Mahbub Hafiz et al.

The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored. Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions. To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning enhanced generalizability, compression, and faster inference times are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness.

7.2CRApr 2

Towards Multi-Stakeholder Vulnerability Notifications in the Ad-Tech Supply Chain

Yash Vekaria, Rishab Nithyanand, Zubair Shafiq

Online advertising relies on a complex and opaque supply chain that involves multiple stakeholders, including advertisers, publishers, and ad-networks, each with distinct and sometimes conflicting incentives. Recent research has demonstrated the existence of ad-tech supply chain vulnerabilities such as dark pooling, where low-quality publishers bundle their ad inventory with higher-quality ones to mislead advertisers. We investigate the effectiveness of vulnerability notification campaigns aimed at mitigating dark pooling. Prior research on vulnerability notifications have primarily explored single-stakeholder contexts, leaving multi-stakeholder scenarios understudied. There is limited attention to complex multi-stakeholder supply chain ecosystems such as ad-tech supply chain, where resolving vulnerabilities often requires coordinated action across entities with misaligned incentives and interdependent roles. We address this gap by implementing the first online advertising supply chain vulnerability notification pipeline to systematically evaluate the responsiveness of various stakeholders in ad-tech supply chain, including publishers, ad-networks, and advertisers to vulnerability notifications by academics and activists. Our nine-month long automated multi-stakeholder notification study shows that notifications are an effective method for reducing dark pooling vulnerabilities in the online advertising ecosystem, especially when targeted towards ad-networks. Further, the sender reputation does not impact responses to notifications from activists and academics in a statistically different way. Overall, our research fosters industry-scale solution to combat ad inventory fraud and fosters future research on feasibility of multi-stakeholder vulnerability notifications in other supply chain ecosystems.

1.2CYAug 1, 2025

Catching Dark Signals in Algorithms: Unveiling Audiovisual and Thematic Markers of Unsafe Content Recommended for Children and Teenagers

Haoning Xue, Brian Nishimine, Martin Hilbert et al.

The prevalence of short form video platforms, combined with the ineffectiveness of age verification mechanisms, raises concerns about the potential harms facing children and teenagers in an algorithm-moderated online environment. We conducted multimodal feature analysis and thematic topic modeling of 4,492 short videos recommended to children and teenagers on Instagram Reels, TikTok, and YouTube Shorts, collected as a part of an algorithm auditing experiment. This feature-level and content-level analysis revealed that unsafe (i.e., problematic, mentally distressing) short videos (a) possess darker visual features and (b) contain explicitly harmful content and implicit harm from anxiety-inducing ordinary content. We introduce a useful framework of online harm (i.e., explicit, implicit, unintended), providing a unique lens for understanding the dynamic, multifaceted online risks facing children and teenagers. The findings highlight the importance of protecting younger audiences in critical developmental stages from both explicit and implicit risks on social media, calling for nuanced content moderation, age verification, and platform regulation.

6.6CRMar 10Code

PixelConfig: Longitudinal Measurement and Reverse-Engineering of Meta Pixel Configurations

Abdullah Ghani, Yash Vekaria, Zubair Shafiq

Tracking pixels are used to optimize online ad campaigns through personalization, re-targeting, and conversion tracking. Past research has primarily focused on detecting the prevalence of tracking pixels on the web, with limited attention to how they are configured across websites. A tracking pixel may be configured differently on different websites. In this paper, we present a differential analysis framework: PixelConfig, to reverse-engineer the configurations of Meta Pixel deployments across the web. Using this framework, we investigate three types of Meta Pixel configurations: activity tracking (i.e., what a user is doing on a website), identity tracking (i.e., who a user is or who the device is associated with), and tracking restrictions (i.e., mechanisms to limit the sharing of potentially sensitive information). Using data from the Internet Archive's Wayback Machine, we analyze and compare Meta Pixel configurations on 18K health-related websites with a control group of the top 10K websites from 2017 to 2024. We find that activity tracking features, such as automatic events that collect button clicks and page metadata, and identity tracking features, such as first-party cookies that are unaffected by third-party cookie blocking, reached adoption rates of up to 98.4%, largely driven by the Pixel's default settings. We also find that the Pixel is being used to track potentially sensitive information, such as user interactions related to booking medical appointments and button clicks associated with specific medical conditions (e.g., erectile dysfunction) on health-related websites. Tracking restriction features, such as Core Setup, are configured on up to 34.3% of health websites and 8.7% of control websites. However, even when enabled, these tracking restriction features provide limited protection and can be circumvented in practice.

13.1HCMar 20, 2025

Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

Yash Vekaria, Aurelio Loris Canino, Jonathan Levitsky et al.

Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user's form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile--which carries across browsing contexts--to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.

4.1LGFeb 13, 2025

AutoLike: Auditing Social Media Recommendations through User Interactions

Hieu Le, Salma Elmalaki, Zubair Shafiq et al.

Modern social media platforms, such as TikTok, Facebook, and YouTube, rely on recommendation systems to personalize content for users based on user interactions with endless streams of content, such as "For You" pages. However, these complex algorithms can inadvertently deliver problematic content related to self-harm, mental health, and eating disorders. We introduce AutoLike, a framework to audit recommendation systems in social media platforms for topics of interest and their sentiments. To automate the process, we formulate the problem as a reinforcement learning problem. AutoLike drives the recommendation system to serve a particular type of content through interactions (e.g., liking). We apply the AutoLike framework to the TikTok platform as a case study. We evaluate how well AutoLike identifies TikTok content automatically across nine topics of interest; and conduct eight experiments to demonstrate how well it drives TikTok's recommendation system towards particular topics and sentiments. AutoLike has the potential to assist regulators in auditing recommendation systems for problematic content. (Warning: This paper contains qualitative examples that may be viewed as offensive or harmful.)

4.6LGFeb 25, 2022Code

AutoFR: Automated Filter Rule Generation for Adblocking

Hieu Le, Salma Elmalaki, Athina Markopoulou et al.

Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design an algorithm based on multi-arm bandits to generate filter rules that block ads while controlling the trade-off between blocking ads and avoiding visual breakage. We test AutoFR on thousands of sites and we show that it is efficient: it takes only a few minutes to generate filter rules for a site of interest. AutoFR is effective: it generates filter rules that can block 86% of the ads, as compared to 87% by EasyList, while achieving comparable visual breakage. Furthermore, AutoFR generates filter rules that generalize well to new sites. We envision that AutoFR can assist the adblocking community in filter rule generation at scale.

13.6CRDec 3, 2021

FP-Radar: Longitudinal Measurement and Early Detection of Browser Fingerprinting

Pouneh Nikkhah Bahrami, Umar Iqbal, Zubair Shafiq

Browser fingerprinting is a stateless tracking technique that attempts to combine information exposed by multiple different web APIs to create a unique identifier for tracking users across the web. Over the last decade, trackers have abused several existing and newly proposed web APIs to further enhance the browser fingerprint. Existing approaches are limited to detecting a specific fingerprinting technique(s) at a particular point in time. Thus, they are unable to systematically detect novel fingerprinting techniques that abuse different web APIs. In this paper, we propose FP-Radar, a machine learning approach that leverages longitudinal measurements of web API usage on top-100K websites over the last decade, for early detection of new and evolving browser fingerprinting techniques. The results show that FP-Radar is able to early detect the abuse of newly introduced properties of already known (e.g., WebGL, Sensor) and as well as previously unknown (e.g., Gamepad, Clipboard) APIs for browser fingerprinting. To the best of our knowledge, FP-Radar is also the first to detect the abuse of the Visibility API for ephemeral fingerprinting in the wild.

1.6LGNov 9, 2021

HARPO: Learning to Subvert Online Behavioral Advertising

Jiang Zhang, Konstantinos Psounis, Muhammad Haroon et al.

Online behavioral advertising, and the associated tracking paraphernalia, poses a real privacy threat. Unfortunately, existing privacy-enhancing tools are not always effective against online advertising and tracking. We propose Harpo, a principled learning-based approach to subvert online behavioral advertising through obfuscation. Harpo uses reinforcement learning to adaptively interleave real page visits with fake pages to distort a tracker's view of a user's browsing profile. We evaluate Harpo against real-world user profiling and ad targeting models used for online behavioral advertising. The results show that Harpo improves privacy by triggering more than 40% incorrect interest segments and 6x higher bid values. Harpo outperforms existing obfuscation tools by as much as 16x for the same overhead. Harpo is also able to achieve better stealthiness to adversarial detection than existing obfuscation tools. Harpo meaningfully advances the state-of-the-art in leveraging obfuscation to subvert online behavioral advertising

7.5LGSep 15, 2021

Avengers Ensemble! Improving Transferability of Authorship Obfuscation

Muhammad Haroon, Fareed Zaffar, Padmini Srinivasan et al.

Stylometric approaches have been shown to be quite effective for real-world authorship attribution. To mitigate the privacy threat posed by authorship attribution, researchers have proposed automated authorship obfuscation approaches that aim to conceal the stylometric artefacts that give away the identity of an anonymous document's author. Recent work has focused on authorship obfuscation approaches that rely on black-box access to an attribution classifier to evade attribution while preserving semantics. However, to be useful under a realistic threat model, it is important that these obfuscation approaches work well even when the adversary's attribution classifier is different from the one used internally by the obfuscator. Unfortunately, existing authorship obfuscation approaches do not transfer well to unseen attribution classifiers. In this paper, we propose an ensemble-based approach for transferable authorship obfuscation. Our experiments show that if an obfuscator can evade an ensemble attribution classifier, which is based on multiple base attribution classifiers, it is more likely to transfer to different attribution classifiers. Our analysis shows that ensemble-based authorship obfuscation achieves better transferability because it combines the knowledge from each of the base attribution classifiers by essentially averaging their decision boundaries.

13.6CRJul 23, 2021

WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking

Sandra Siby, Umar Iqbal, Steven Englehardt et al.

Millions of web users directly depend on ad and tracker blocking tools to protect their privacy. However, existing ad and tracker blockers fall short because of their reliance on trivially susceptible advertising and tracking content. In this paper, we first demonstrate that the state-of-the-art machine learning based ad and tracker blockers, such as AdGraph, are susceptible to adversarial evasions deployed in real-world. Second, we introduce WebGraph, the first graph-based machine learning blocker that detects ads and trackers based on their action rather than their content. By building features around the actions that are fundamental to advertising and tracking - storing an identifier in the browser, or sharing an identifier with another tracker - WebGraph performs nearly as well as prior approaches, but is significantly more robust to adversarial evasions. In particular, we show that WebGraph achieves comparable accuracy to AdGraph, while significantly decreasing the success rate of an adversary from near-perfect under AdGraph to around 8% under WebGraph. Finally, we show that WebGraph remains robust to a more sophisticated adversary that uses evasion techniques beyond those currently deployed on the web.

31.6CLJun 3, 2021Code

Fingerprinting Fine-tuned Language Models in the Wild

Nirav Diwan, Tanmoy Chakravorty, Zubair Shafiq

There are concerns that the ability of language models (LMs) to generate high quality synthetic text can be misused to launch spam, disinformation, or propaganda. Therefore, the research community is actively working on developing approaches to detect whether a given text is organic or synthetic. While this is a useful first step, it is important to be able to further fingerprint the author LM to attribute its origin. Prior work on fingerprinting LMs is limited to attributing synthetic text generated by a handful (usually < 10) of pre-trained LMs. However, LMs such as GPT2 are commonly fine-tuned in a myriad of ways (e.g., on a domain-specific text corpus) before being used to generate synthetic text. It is challenging to fingerprinting fine-tuned LMs because the universe of fine-tuned LMs is much larger in realistic scenarios. To address this challenge, we study the problem of large-scale fingerprinting of fine-tuned LMs in the wild. Using a real-world dataset of synthetic text generated by 108 different fine-tuned LMs, we conduct comprehensive experiments to demonstrate the limitations of existing fingerprinting approaches. Our results show that fine-tuning itself is the most effective in attributing the synthetic text generated by fine-tuned LMs.

10.6LGMay 12, 2021Code

Accuracy-Privacy Trade-off in Deep Ensemble: A Membership Inference Perspective

Shahbaz Rezaei, Zubair Shafiq, Xin Liu

Deep ensemble learning has been shown to improve accuracy by training multiple neural networks and averaging their outputs. Ensemble learning has also been suggested to defend against membership inference attacks that undermine privacy. In this paper, we empirically demonstrate a trade-off between these two goals, namely accuracy and privacy (in terms of membership inference attacks), in deep ensembles. Using a wide range of datasets and model architectures, we show that the effectiveness of membership inference attacks increases when ensembling improves accuracy. We analyze the impact of various factors in deep ensembles and demonstrate the root cause of the trade-off. Then, we evaluate common defenses against membership inference attacks based on regularization and differential privacy. We show that while these defenses can mitigate the effectiveness of membership inference attacks, they simultaneously degrade ensemble accuracy. We illustrate similar trade-off in more advanced and state-of-the-art ensembling techniques, such as snapshot ensembles and diversified ensemble networks. Finally, we propose a simple yet effective defense for deep ensembles to break the trade-off and, consequently, improve the accuracy and privacy, simultaneously.

23.8CRAug 11, 2020

Fingerprinting the Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors

Umar Iqbal, Steven Englehardt, Zubair Shafiq

Browser fingerprinting is an invasive and opaque stateless tracking technique. Browser vendors, academics, and standards bodies have long struggled to provide meaningful protections against browser fingerprinting that are both accurate and do not degrade user experience. We propose FP-Inspector, a machine learning based syntactic-semantic approach to accurately detect browser fingerprinting. We show that FP-Inspector performs well, allowing us to detect 26% more fingerprinting scripts than the state-of-the-art. We show that an API-level fingerprinting countermeasure, built upon FP-Inspector, helps reduce website breakage by a factor of 2. We use FP-Inspector to perform a measurement study of browser fingerprinting on top-100K websites. We find that browser fingerprinting is now present on more than 10% of the top-100K websites and over a quarter of the top-10K websites. We also discover previously unreported uses of JavaScript APIs by fingerprinting scripts suggesting that they are looking to exploit APIs in new and unexpected ways.

31.1CLMay 2, 2020Code

A Girl Has A Name: Detecting Authorship Obfuscation

Asad Mahmood, Zubair Shafiq, Padmini Srinivasan

Authorship attribution aims to identify the author of a text based on the stylometric analysis. Authorship obfuscation, on the other hand, aims to protect against authorship attribution by modifying a text's style. In this paper, we evaluate the stealthiness of state-of-the-art authorship obfuscation methods under an adversarial threat model. An obfuscator is stealthy to the extent an adversary finds it challenging to detect whether or not a text modified by the obfuscator is obfuscated - a decision that is key to the adversary interested in authorship attribution. We show that the existing authorship obfuscation methods are not stealthy as their obfuscated texts can be identified with an average F1 score of 0.87. The reason for the lack of stealthiness is that these obfuscators degrade text smoothness, as ascertained by neural language models, in a detectable manner. Our results highlight the need to develop stealthy authorship obfuscation methods that can better protect the identity of an author seeking anonymity.

13.6CRJan 29, 2020

A4 : Evading Learning-based Adblockers

Shitong Zhu, Zhongjie Wang, Xun Chen et al.

Efforts by online ad publishers to circumvent traditional ad blockers towards regaining fiduciary benefits, have been demonstrably successful. As a result, there have recently emerged a set of adblockers that apply machine learning instead of manually curated rules and have been shown to be more robust in blocking ads on websites including social media sites such as Facebook. Among these, AdGraph is arguably the state-of-the-art learning-based adblocker. In this paper, we develop A4, a tool that intelligently crafts adversarial samples of ads to evade AdGraph. Unlike the popular research on adversarial samples against images or videos that are considered less- to un-restricted, the samples that A4 generates preserve application semantics of the web page, or are actionable. Through several experiments we show that A4 can bypass AdGraph about 60% of the time, which surpasses the state-of-the-art attack by a significant margin of 84.3%; in addition, changes to the visual layout of the web page due to these perturbations are imperceptible. We envision the algorithmic framework proposed in A4 is also promising in improving adversarial attacks against other learning-based web applications with similar requirements.

18.1CYMay 22, 2018

AdGraph: A Graph-Based Approach to Ad and Tracker Blocking

Umar Iqbal, Peter Snyder, Shitong Zhu et al.

User demand for blocking advertising and tracking online is large and growing. Existing tools, both deployed and described in research, have proven useful, but lack either the completeness or robustness needed for a general solution. Existing detection approaches generally focus on only one aspect of advertising or tracking (e.g. URL patterns, code structure), making existing approaches susceptible to evasion. In this work we present AdGraph, a novel graph-based machine learning approach for detecting advertising and tracking resources on the web. AdGraph differs from existing approaches by building a graph representation of the HTML structure, network requests, and JavaScript behavior of a webpage, and using this unique representation to train a classifier for identifying advertising and tracking resources. Because AdGraph considers many aspects of the context a network request takes place in, it is less susceptible to the single-factor evasion techniques that flummox existing approaches. We evaluate AdGraph on the Alexa top-10K websites, and find that it is highly accurate, able to replicate the labels of human-generated filter lists with 95.33% accuracy, and can even identify many mistakes in filter lists. We implement AdGraph as a modification to Chromium. AdGraph adds only minor overhead to page loading and execution, and is actually faster than stock Chromium on 42% of websites and AdBlock Plus on 78% of websites. Overall, we conclude that AdGraph is both accurate enough and performant enough for online use, breaking comparable or fewer websites than popular filter list based approaches.