74.4SIMay 28
Scalable AI-Driven Analytics for User Engagement and Stance Detection on Social MediaThammitage Piyumi Wathsala Seneviratne, Muhammad Ikram, Dinusha Vatsalan et al.
Social media platforms have become a major vector for the large-scale dissemination of misinformation and conspiracy content, posing significant risks to public trust, health, and societal stability. While prior work has primarily focused on analysing such content from a behavioural or content-centric perspective, there is a lack of scalable, service-oriented solutions that enable continuous monitoring and analysis of user engagement at platform scale. In this paper, we present a scalable AI-driven service framework for analysing user engagement and stance on social media content. Our system integrates data ingestion, filtering, topic modelling, sentiment analysis, and stance detection into a modular pipeline that can operate on large-scale, real-world datasets. We implement and evaluate our framework on a dataset comprising over 7 million user comments collected from nearly 50,000 YouTube videos associated with conspiracy narratives. Our analysis reveals that conspiracy content attracts up to 70% of total user engagement within the first week of publication, indicating strong early amplification dynamics. Furthermore, we identify a subset of highly active users who exhibit disproportionately high engagement across multiple videos and channels. Stance analysis shows that a majority of users express favourable positions toward conspiracy narratives, highlighting the role of user communities in reinforcing such content. The proposed framework demonstrates the feasibility of deploying scalable, service-oriented analytics for real-time monitoring of user engagement and behavioural patterns. These findings demonstrate the effectiveness of our framework in capturing large-scale engagement dynamics and highlight the importance of early-stage detection and service-based monitoring for mitigating the spread of harmful content.
48.4CRApr 19Code
Original Sin of npm: A Study on Vulnerability Propagation in JavaScript Dependency NetworksMichael Robinson, Sajal Halder, Muhammad Ejaz Ahmed et al.
Understanding vulnerability propagation is essential for assessing how vulnerabilities spread across components of a software package. This supports more accurate impact analysis and enhances threat detection and mitigation. In this paper, we investigate how a small number of vulnerable JavaScript packages contribute to the creation of a disproportionately large number of vulnerable packages. This paper presents insights from 1,515 reported vulnerabilities gathered from a custom-built vulnerability database containing 1,077,946 JavaScript packages sourced from `npm-follower' and their associated dependency networks. Dependency networks were constructed using the deps.dev API, with vulnerabilities identified by parsing package names and version numbers through the Google Open Source Vulnerability API. Our findings reveal that 61.30% (660,748) of packages are reliant on one or more dependency packages, and 21.60% (232,836) of total packages have at least one known vulnerability throughout their dependency networks -- of which most (42%) are of High severity. We also found that it takes, on average, approximately 4 years and 11 months to fix a vulnerable package from when the first vulnerable version is published on npm -- although publication times of vulnerabilities occur approximately 19 days after a fix is available. Finally, we observe a high concentration of frequently present vulnerabilities throughout dependency networks, with the top-7 most frequent vulnerabilities accounting for 25% of vulnerability cases and the top-23 most frequent accounting for 50%. Based on these findings, we propose recommendations for developers and package managers to mitigate the threat and occurrence of vulnerabilities within the npm dependency network and the broader software repository community.
SIAug 24, 2023
False Information, Bots and Malicious Campaigns: Demystifying Elements of Social Media ManipulationsMohammad Majid Akhtar, Rahat Masood, Muhammad Ikram et al.
The rapid spread of false information and persistent manipulation attacks on online social networks (OSNs), often for political, ideological, or financial gain, has affected the openness of OSNs. While researchers from various disciplines have investigated different manipulation-triggering elements of OSNs (such as understanding information diffusion on OSNs or detecting automated behavior of accounts), these works have not been consolidated to present a comprehensive overview of the interconnections among these elements. Notably, user psychology, the prevalence of bots, and their tactics in relation to false information detection have been overlooked in previous research. To address this research gap, this paper synthesizes insights from various disciplines to provide a comprehensive analysis of the manipulation landscape. By integrating the primary elements of social media manipulation (SMM), including false information, bots, and malicious campaigns, we extensively examine each SMM element. Through a systematic investigation of prior research, we identify commonalities, highlight existing gaps, and extract valuable insights in the field. Our findings underscore the urgent need for interdisciplinary research to effectively combat social media manipulations, and our systematization can guide future research efforts and assist OSN providers in ensuring the safety and integrity of their platforms.
62.0CLMay 20Code
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language ModelsSunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood
Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.
SISep 7, 2022
Machine Learning-based Automatic Annotation and Detection of COVID-19 Fake NewsMohammad Majid Akhtar, Bibhas Sharma, Ishan Karunanayake et al.
COVID-19 impacted every part of the world, although the misinformation about the outbreak traveled faster than the virus. Misinformation spread through online social networks (OSN) often misled people from following correct medical practices. In particular, OSN bots have been a primary source of disseminating false information and initiating cyber propaganda. Existing work neglects the presence of bots that act as a catalyst in the spread and focuses on fake news detection in 'articles shared in posts' rather than the post (textual) content. Most work on misinformation detection uses manually labeled datasets that are hard to scale for building their predictive models. In this research, we overcome this challenge of data scarcity by proposing an automated approach for labeling data using verified fact-checked statements on a Twitter dataset. In addition, we combine textual features with user-level features (such as followers count and friends count) and tweet-level features (such as number of mentions, hashtags and urls in a tweet) to act as additional indicators to detect misinformation. Moreover, we analyzed the presence of bots in tweets and show that bots change their behavior over time and are most active during the misinformation campaign. We collected 10.22 Million COVID-19 related tweets and used our annotation model to build an extensive and original ground truth dataset for classification purposes. We utilize various machine learning models to accurately detect misinformation and our best classification model achieves precision (82%), recall (96%), and false positive rate (3.58%). Also, our bot analysis indicates that bots generated approximately 10% of misinformation tweets. Our methodology results in substantial exposure of false information, thus improving the trustworthiness of information disseminated through social media platforms.
CRAug 5, 2024
On the Robustness of Malware Detectors to Adversarial SamplesMuhammad Salman, Benjamin Zi Hao Zhao, Hassan Jameel Asghar et al.
Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary.
CRFeb 6
Pro-ZD: A Transferable Graph Neural Network Approach for Proactive Zero-Day Threats MitigationNardine Basta, Firas Ben Hmida, Houssem Jmal et al.
In today's enterprise network landscape, the combination of perimeter and distributed firewall rules governs connectivity. To address challenges arising from increased traffic and diverse network architectures, organizations employ automated tools for firewall rule and access policy generation. Yet, effectively managing risks arising from dynamically generated policies, especially concerning critical asset exposure, remains a major challenge. This challenge is amplified by evolving network structures due to trends like remote users, bring-your-own devices, and cloud integration. This paper introduces a novel graph neural network model for identifying weighted shortest paths. The model aids in detecting network misconfigurations and high-risk connectivity paths that threaten critical assets, potentially exploited in zero-day attacks -- cyber-attacks exploiting undisclosed vulnerabilities. The proposed Pro-ZD framework adopts a proactive approach, automatically fine-tuning firewall rules and access policies to address high-risk connections and prevent unauthorized access. Experimental results highlight the robustness and transferability of Pro-ZD, achieving over 95% average accuracy in detecting high-risk connections. \
SIFeb 6, 2024
BotSSCL: Social Bot Detection with Self-Supervised Contrastive LearningMohammad Majid Akhtar, Navid Shadman Bhuiyan, Rahat Masood et al.
The detection of automated accounts, also known as "social bots", has been an increasingly important concern for online social networks (OSNs). While several methods have been proposed for detecting social bots, significant research gaps remain. First, current models exhibit limitations in detecting sophisticated bots that aim to mimic genuine OSN users. Second, these methods often rely on simplistic profile features, which are susceptible to manipulation. In addition to their vulnerability to adversarial manipulations, these models lack generalizability, resulting in subpar performance when trained on one dataset and tested on another. To address these challenges, we propose a novel framework for social Bot detection with Self-Supervised Contrastive Learning (BotSSCL). Our framework leverages contrastive learning to distinguish between social bots and humans in the embedding space to improve linear separability. The high-level representations derived by BotSSCL enhance its resilience to variations in data distribution and ensure generalizability. We evaluate BotSSCL's robustness against adversarial attempts to manipulate bot accounts to evade detection. Experiments on two datasets featuring sophisticated bots demonstrate that BotSSCL outperforms other supervised, unsupervised, and self-supervised baseline methods. We achieve approx. 6% and approx. 8% higher (F1) performance than SOTA on both datasets. In addition, BotSSCL also achieves 67% F1 when trained on one dataset and tested with another, demonstrating its generalizability. Lastly, BotSSCL increases adversarial complexity and only allows 4% success to the adversary in evading detection.
CRMay 13, 2025
A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI EcosystemSunday Oyinlola Ogundoyin, Muhammad Ikram, Hassan Jameel Asghar et al.
Millions of users leverage generative pretrained transformer (GPT)-based language models developed by leading model providers for a wide range of tasks. To support enhanced user interaction and customization, many platforms-such as OpenAI-now enable developers to create and publish tailored model instances, known as custom GPTs, via dedicated repositories or application stores. These custom GPTs empower users to browse and interact with specialized applications designed to meet specific needs. However, as custom GPTs see growing adoption, concerns regarding their security vulnerabilities have intensified. Existing research on these vulnerabilities remains largely theoretical, often lacking empirical, large-scale, and statistically rigorous assessments of associated risks. In this study, we analyze 14,904 custom GPTs to assess their susceptibility to seven exploitable threats, such as roleplay-based attacks, system prompt leakage, phishing content generation, and malicious code synthesis, across various categories and popularity tiers within the OpenAI marketplace. We introduce a multi-metric ranking system to examine the relationship between a custom GPT's popularity and its associated security risks. Our findings reveal that over 95% of custom GPTs lack adequate security protections. The most prevalent vulnerabilities include roleplay-based vulnerabilities (96.51%), system prompt leakage (92.20%), and phishing (91.22%). Furthermore, we demonstrate that OpenAI's foundational models exhibit inherent security weaknesses, which are often inherited or amplified in custom GPTs. These results highlight the urgent need for enhanced security measures and stricter content moderation to ensure the safe deployment of GPT-based applications.
CRFeb 21, 2022
Don't be a Victim During a Pandemic! Analysing Security and Privacy Threats in Twitter During COVID-19Bibhas Sharma, Ishan Karunanayake, Rahat Masood et al.
There has been a huge spike in the usage of social media platforms during the COVID-19 lockdowns. These lockdown periods have resulted in a set of new cybercrimes, thereby allowing attackers to victimise social media users with a range of threats. This paper performs a large-scale study to investigate the impact of a pandemic and the lockdown periods on the security and privacy of social media users. We analyse 10.6 Million COVID-related tweets from 533 days of data crawling and investigate users' security and privacy behaviour in three different periods (i.e., before, during, and after the lockdown). Our study shows that users unintentionally share more personal identifiable information when writing about the pandemic situation (e.g., sharing nearby coronavirus testing locations) in their tweets. The privacy risk reaches 100% if a user posts three or more sensitive tweets about the pandemic. We investigate the number of suspicious domains shared on social media during different phases of the pandemic. Our analysis reveals an increase in the number of suspicious domains during the lockdown compared to other lockdown phases. We observe that IT, Search Engines, and Businesses are the top three categories that contain suspicious domains. Our analysis reveals that adversaries' strategies to instigate malicious activities change with the country's pandemic situation.
CRJul 29, 2021
Empirical Security and Privacy Analysis of Mobile Symptom Checking Applications on Google PlayI Wayan Budi Sentana, Muhammad Ikram, Mohamed Ali Kaafar et al.
Smartphone technology has drastically improved over the past decade. These improvements have seen the creation of specialized health applications, which offer consumers a range of health-related activities such as tracking and checking symptoms of health conditions or diseases through their smartphones. We term these applications as Symptom Checking apps or simply SymptomCheckers. Due to the sensitive nature of the private data they collect, store and manage, leakage of user information could result in significant consequences. In this paper, we use a combination of techniques from both static and dynamic analysis to detect, trace and categorize security and privacy issues in 36 popular SymptomCheckers on Google Play. Our analyses reveal that SymptomCheckers request a significantly higher number of sensitive permissions and embed a higher number of third-party tracking libraries for targeted advertisements and analytics exploiting the privileged access of the SymptomCheckers in which they exist, as a mean of collecting and sharing critically sensitive data about the user and their device. We find that these are sharing the data that they collect through unencrypted plain text to the third-party advertisers and, in some cases, to malicious domains. The results reveal that the exploitation of SymptomCheckers is present in popular apps, still readily available on Google Play.
CRJul 15, 2021
BlockJack: Towards Improved Prevention of IP Prefix Hijacking Attacks in Inter-Domain Routing Via BlockchainI Wayan Budi Sentana, Muhammad Ikram, Mohamed Ali Kaafar
We propose BlockJack, a system based on a distributed and tamper-proof consortium Blockchain that aims at blocking IP prefix hijacking in the Border Gateway Protocol (BGP). In essence, BlockJack provides synchronization among BlockChain and BGP network through interfaces ensuring operational independence and this approach preserving the legacy system and accommodates the impact of a race condition if the Blockchain process exceeds the BGP update interval. BlockJack is also resilient to dynamic routing path changes during the occurrence of the IP prefix hijacking in the routing tables. We implement BlockJack using Hyperledger Fabric Blockchain and Quagga software package and we perform initial sets of experiments to evaluate its efficacy. We evaluate the performance and resilience of BlockJack in various attack scenarios including single path attacks, multiple path attacks, and attacks from random sources in the random network topology. The Evaluation results show that BlockJack is able to handle multiple attacks caused by AS paths changes during a BGP prefix hijacking. In experiment settings with 50 random routers, BlockJack takes on average 0.08 seconds (with a standard deviation of 0.04 seconds) to block BGP prefix hijacking attacks. The test result showing that BlockJack conservative approach feasible to handle the IP Prefix hijacking in the Border Gateway Protocol.
CRJun 18, 2021
Longitudinal Compliance Analysis of Android Applications with Privacy PoliciesSaad Sajid Hashmi, Nazar Waheed, Gioacchino Tangari et al.
Contemporary mobile applications (apps) are designed to track, use, and share users' data, often without their consent, which results in potential privacy and transparency issues. To investigate whether mobile apps have always been (non-)transparent regarding how they collect information about users, we perform a longitudinal analysis of the historical versions of 268 Android apps. These apps comprise 5,240 app releases or versions between 2008 and 2016. We detect inconsistencies between apps' behaviors and the stated use of data collection in privacy policies to reveal compliance issues. We utilize machine learning techniques for the classification of the privacy policy text to identify the purported practices that collect and/or share users' personal information, such as phone numbers and email addresses. We then uncover the data leaks of an app through static and dynamic analysis. Over time, our results show a steady increase in the number of apps' data collection practices that are undisclosed in the privacy policies. This behavior is particularly troubling since privacy policy is the primary tool for describing the app's privacy protection practices. We find that newer versions of the apps are likely to be more non-compliant than their preceding versions. The discrepancies between the purported and the actual data practices show that privacy policies are often incoherent with the apps' behaviors, thus defying the 'notice and choice' principle when users install apps.
CRFeb 10, 2020
Security and Privacy in IoT Using Machine Learning and Blockchain: Threats & CountermeasuresNazar Waheed, Xiangjian He, Muhammad Ikram et al.
Security and privacy of the users have become significant concerns due to the involvement of the Internet of things (IoT) devices in numerous applications. Cyber threats are growing at an explosive pace making the existing security and privacy measures inadequate. Hence, everyone on the Internet is a product for hackers. Consequently, Machine Learning (ML) algorithms are used to produce accurate outputs from large complex databases, where the generated outputs can be used to predict and detect vulnerabilities in IoT-based systems. Furthermore, Blockchain (BC) techniques are becoming popular in modern IoT applications to solve security and privacy issues. Several studies have been conducted on either ML algorithms or BC techniques. However, these studies target either security or privacy issues using ML algorithms or BC techniques, thus posing a need for a combined survey on efforts made in recent years addressing both security and privacy issues using ML algorithms and BC techniques. In this paper, we provide a summary of research efforts made in the past few years, starting from 2008 to 2019, addressing security and privacy issues using ML algorithms and BCtechniques in the IoT domain. First, we discuss and categorize various security and privacy threats reported in the past twelve years in the IoT domain. Then, we classify the literature on security and privacy efforts based on ML algorithms and BC techniques in the IoT domain. Finally, we identify and illuminate several challenges and future research directions in using ML algorithms and BC techniques to address security and privacy issues in the IoT domain.
CRJun 1, 2019
A Longitudinal Analysis of Online Ad-Blocking BlacklistsSaad Sajid Hashmi, Muhammad Ikram, Mohamed Ali Kaafar
Websites employ third-party ads and tracking services leveraging cookies and JavaScript code, to deliver ads and track users' behavior, causing privacy concerns. To limit online tracking and block advertisements, several ad-blocking (black) lists have been curated consisting of URLs and domains of well-known ads and tracking services. Using Internet Archive's Wayback Machine in this paper, we collect a retrospective view of the Web to analyze the evolution of ads and tracking services and evaluate the effectiveness of ad-blocking blacklists. We propose metrics to capture the efficacy of ad-blocking blacklists to investigate whether these blacklists have been reactive or proactive in tackling the online ad and tracking services. We introduce a stability metric to measure the temporal changes in ads and tracking domains blocked by ad-blocking blacklists, and a diversity metric to measure the ratio of new ads and tracking domains detected. We observe that ads and tracking domains in websites change over time, and among the ad-blocking blacklists that we investigated, our analysis reveals that some blacklists were more informed with the existence of ads and tracking domains, but their rate of change was slower than other blacklists. Our analysis also shows that Alexa top 5K websites in the US, Canada, and the UK have the most number of ads and tracking domains per website, and have the highest proactive scores. This suggests that ad-blocking blacklists are updated by prioritizing ads and tracking domains reported in the popular websites from these countries.
CRMay 22, 2019
DaDiDroid: An Obfuscation Resilient Tool for Detecting Android Malware via Weighted Directed Call Graph ModellingMuhammad Ikram, Pierrick Beaume, Mohamed Ali Kaafar
With the number of new mobile malware instances increasing by over 50\% annually since 2012 [24], malware embedding in mobile apps is arguably one of the most serious security issues mobile platforms are exposed to. While obfuscation techniques are successfully used to protect the intellectual property of apps' developers, they are unfortunately also often used by cybercriminals to hide malicious content inside mobile apps and to deceive malware detection tools. As a consequence, most of mobile malware detection approaches fail in differentiating between benign and obfuscated malicious apps. We examine the graph features of mobile apps code by building weighted directed graphs of the API calls, and verify that malicious apps often share structural similarities that can be used to differentiate them from benign apps, even under a heavily "polluted" training set where a large majority of the apps are obfuscated. We present DaDiDroid an Android malware app detection tool that leverages features of the weighted directed graphs of API calls to detect the presence of malware code in (obfuscated) Android apps. We show that DaDiDroid significantly outperforms MaMaDroid [23], a recently proposed malware detection tool that has been proven very efficient in detecting malware in a clean non-obfuscated environment. We evaluate DaDiDroid's accuracy and robustness against several evasion techniques using various datasets for a total of 43,262 benign and 20,431 malware apps. We show that DaDiDroid correctly labels up to 96% of Android malware samples, while achieving an 91% accuracy with an exclusive use of a training set of obfuscated apps.
CRApr 24, 2019
A Decade of Mal-Activity Reporting: A Retrospective Analysis of Internet Malicious Activity BlacklistsBenjamin Zi Hao Zhao, Muhammad Ikram, Hassan Jameel Asghar et al.
This paper focuses on reporting of Internet malicious activity (or mal-activity in short) by public blacklists with the objective of providing a systematic characterization of what has been reported over the years, and more importantly, the evolution of reported activities. Using an initial seed of 22 blacklists, covering the period from January 2007 to June 2017, we collect more than 51 million mal-activity reports involving 662K unique IP addresses worldwide. Leveraging the Wayback Machine, antivirus (AV) tool reports and several additional public datasets (e.g., BGP Route Views and Internet registries) we enrich the data with historical meta-information including geo-locations (countries), autonomous system (AS) numbers and types of mal-activity. Furthermore, we use the initially labelled dataset of approx 1.57 million mal-activities (obtained from public blacklists) to train a machine learning classifier to classify the remaining unlabeled dataset of approx 44 million mal-activities obtained through additional sources. We make our unique collected dataset (and scripts used) publicly available for further research. The main contributions of the paper are a novel means of report collection, with a machine learning approach to classify reported activities, characterization of the dataset and, most importantly, temporal analysis of mal-activity reporting behavior. Inspired by P2P behavior modeling, our analysis shows that some classes of mal-activities (e.g., phishing) and a small number of mal-activity sources are persistent, suggesting that either blacklist-based prevention systems are ineffective or have unreasonably long update periods. Our analysis also indicates that resources can be better utilized by focusing on heavy mal-activity contributors, which constitute the bulk of mal-activities.
CRJan 23, 2019
The Chain of Implicit Trust: An Analysis of the Web Third-party Resources LoadingMuhammad Ikram, Rahat Masood, Gareth Tyson et al.
The Web is a tangled mass of interconnected services, where websites import a range of external resources from various third-party domains. However, the latter can further load resources hosted on other domains. For each website, this creates a dependency chain underpinned by a form of implicit trust between the first-party and transitively connected third-parties. The chain can only be loosely controlled as first-party websites often have little, if any, visibility of where these resources are loaded from. This paper performs a large-scale study of dependency chains in the Web, to find that around 50% of first-party websites render content that they did not directly load. Although the majority (84.91%) of websites have short dependency chains (below 3 levels), we find websites with dependency chains exceeding 30. Using VirusTotal, we show that 1.2% of these third-parties are classified as suspicious --- although seemingly small, this limited set of suspicious third-parties have remarkable reach into the wider ecosystem. By running sandboxed experiments, we observe a range of activities with the majority of suspicious JavaScript downloading malware; worryingly, we find this propensity is greater among implicitly trusted JavaScripts.
CRSep 9, 2017
A First Look at Ad Blocking Apps on Google PlayMuhammad Ikram, Mohamed Ali Kaafar
Online advertisers and analytics services (or trackers), are constantly tracking users activities as they access web services either through browsers or a mobile apps. Numerous tools such as browser plugins and specialized mobile apps have been proposed to limit intrusive advertisements and prevent tracking on desktop computing and mobile phones. For desktop computing, browser plugins are heavily studied for their usability and efficiency issues, however, tools that block ads and prevent trackers in mobile platforms, have received the least or no attention. In this paper, we present a first look at 97 Android adblocking apps (or adblockers), extracted from more than 1.5 million apps from Google Play, that promise to block advertisements and analytics services. With our data collection and analysis pipeline of the Android adblockers, we reveal the presences of third-party tracking libraries and sensitive permissions for critical resources on user mobile devices as well as have malware in the source codes. We analyze users' reviews for the in-effectiveness of adblockers in terms of not blocking ads and trackers. We found that a significant fraction of adblockers are not fulfilling their advertised functionality.
CRJul 1, 2017
Measuring, Characterizing, and Detecting Facebook Like FarmsMuhammad Ikram, Lucky Onwuzurike, Shehroze Farooqi et al.
Social networks offer convenient ways to seamlessly reach out to large audiences. In particular, Facebook pages are increasingly used by businesses, brands, and organizations to connect with multitudes of users worldwide. As the number of likes of a page has become a de-facto measure of its popularity and profitability, an underground market of services artificially inflating page likes, aka like farms, has emerged alongside Facebook's official targeted advertising platform. Nonetheless, there is little work that systematically analyzes Facebook pages' promotion methods. Aiming to fill this gap, we present a honeypot-based comparative measurement study of page likes garnered via Facebook advertising and from popular like farms. First, we analyze likes based on demographic, temporal, and social characteristics, and find that some farms seem to be operated by bots and do not really try to hide the nature of their operations, while others follow a stealthier approach, mimicking regular users' behavior. Next, we look at fraud detection algorithms currently deployed by Facebook and show that they do not work well to detect stealthy farms which spread likes over longer timespans and like popular pages to mimic regular users. To overcome their limitations, we investigate the feasibility of timeline-based detection of like farm accounts, focusing on characterizing content generated by Facebook accounts on their timelines as an indicator of genuine versus fake social activity. We analyze a range of features, grouped into two main categories: lexical and non-lexical. We find that like farm accounts tend to re-share content, use fewer words and poorer vocabulary, and more often generate duplicate comments and likes compared to normal users. Using relevant lexical and non-lexical features, we build a classifier to detect like farms accounts that achieves precision higher than 99% and 93% recall.
CRMar 20, 2016
Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class LearningMuhammad Ikram, Hassan Jameel Asghar, Mohamed Ali Kaafar et al.
Numerous tools have been developed to aggressively block the execution of popular JavaScript programs (JS) in Web browsers. Such blocking also affects functionality of webpages and impairs user experience. As a consequence, many privacy preserving tools (PP-Tools) that have been developed to limit online tracking, often executed via JS, may suffer from poor performance and limited uptake. A mechanism that can isolate JS necessary for proper functioning of the website from tracking JS would thus be useful. Through the use of a manually labelled dataset composed of 2,612 JS, we show how current PP-Tools are ineffective in finding the right balance between blocking tracking JS and allowing functional JS. To the best of our knowledge, this is the first study to assess the performance of current web PP-Tools. To improve this balance, we examine the two classes of JS and hypothesize that tracking JS share structural similarities that can be used to differentiate them from functional JS. The rationale of our approach is that web developers often borrow and customize existing pieces of code in order to embed tracking (resp. functional) JS into their webpages. We then propose one-class machine learning classifiers using syntactic and semantic features extracted from JS. When trained only on samples of tracking JS, our classifiers achieve an accuracy of 99%, where the best of the PP-Tools achieved an accuracy of 78%. We further test our classifiers and several popular PP-Tools on a corpus of 4K websites with 135K JS. The output of our best classifier on this data is between 20 to 64% different from the PP-Tools. We manually analyse a sample of the JS for which our classifier is in disagreement with all other PP-Tools, and show that our approach is not only able to enhance user web experience by correctly classifying more functional JS, but also discovers previously unknown tracking services.
SIJun 1, 2015
Combating Fraud in Online Social Networks: Detecting Stealthy Facebook Like FarmsMuhammad Ikram, Lucky Onwuzurike, Shehroze Farooqi et al.
As businesses increasingly rely on social networking sites to engage with their customers, it is crucial to understand and counter reputation manipulation activities, including fraudulently boosting the number of Facebook page likes using like farms. To this end, several fraud detection algorithms have been proposed and some deployed by Facebook that use graph co-clustering to distinguish between genuine likes and those generated by farm-controlled profiles. However, as we show in this paper, these tools do not work well with stealthy farms whose users spread likes over longer timespans and like popular pages, aiming to mimic regular users. We present an empirical analysis of the graph-based detection tools used by Facebook and highlight their shortcomings against more sophisticated farms. Next, we focus on characterizing content generated by social networks accounts on their timelines, as an indicator of genuine versus fake social activity. We analyze a wide range of features extracted from timeline posts, which we group into two main classes: lexical and non-lexical. We postulate and verify that like farm accounts tend to often re-share content, use fewer words and poorer vocabulary, and more often generate duplicate comments and likes compared to normal users. We extract relevant lexical and non-lexical features and and use them to build a classifier to detect like farms accounts, achieving significantly higher accuracy, namely, at least 99% precision and 93% recall.
CYMay 7, 2015
Characterizing Key Stakeholders in an Online Black-Hat MarketplaceShehroze Farooqi, Muhammad Ikram, Emiliano De Cristofaro et al.
Over the past few years, many black-hat marketplaces have emerged that facilitate access to reputation manipulation services such as fake Facebook likes, fraudulent search engine optimization (SEO), or bogus Amazon reviews. In order to deploy effective technical and legal countermeasures, it is important to understand how these black-hat marketplaces operate, shedding light on the services they offer, who is selling, who is buying, what are they buying, who is more successful, why are they successful, etc. Toward this goal, in this paper, we present a detailed micro-economic analysis of a popular online black-hat marketplace, namely, SEOClerks.com. As the site provides non-anonymized transaction information, we set to analyze selling and buying behavior of individual users, propose a strategy to identify key users, and study their tactics as compared to other (non-key) users. We find that key users: (1) are mostly located in Asian countries, (2) are focused more on selling black-hat SEO services, (3) tend to list more lower priced services, and (4) sometimes buy services from other sellers and then sell at higher prices. Finally, we discuss the implications of our analysis with respect to devising effective economic and legal intervention strategies against marketplace operators and key users.