SIMay 28
Scalable AI-Driven Analytics for User Engagement and Stance Detection on Social MediaThammitage Piyumi Wathsala Seneviratne, Muhammad Ikram, Dinusha Vatsalan et al.
Social media platforms have become a major vector for the large-scale dissemination of misinformation and conspiracy content, posing significant risks to public trust, health, and societal stability. While prior work has primarily focused on analysing such content from a behavioural or content-centric perspective, there is a lack of scalable, service-oriented solutions that enable continuous monitoring and analysis of user engagement at platform scale. In this paper, we present a scalable AI-driven service framework for analysing user engagement and stance on social media content. Our system integrates data ingestion, filtering, topic modelling, sentiment analysis, and stance detection into a modular pipeline that can operate on large-scale, real-world datasets. We implement and evaluate our framework on a dataset comprising over 7 million user comments collected from nearly 50,000 YouTube videos associated with conspiracy narratives. Our analysis reveals that conspiracy content attracts up to 70% of total user engagement within the first week of publication, indicating strong early amplification dynamics. Furthermore, we identify a subset of highly active users who exhibit disproportionately high engagement across multiple videos and channels. Stance analysis shows that a majority of users express favourable positions toward conspiracy narratives, highlighting the role of user communities in reinforcing such content. The proposed framework demonstrates the feasibility of deploying scalable, service-oriented analytics for real-time monitoring of user engagement and behavioural patterns. These findings demonstrate the effectiveness of our framework in capturing large-scale engagement dynamics and highlight the importance of early-stage detection and service-based monitoring for mitigating the spread of harmful content.
CRApr 3, 2022Code
A Differentially Private Framework for Deep Learning with Convexified Loss FunctionsZhigang Lu, Hassan Jameel Asghar, Mohamed Ali Kaafar et al.
Differential privacy (DP) has been applied in deep learning for preserving privacy of the underlying training sets. Existing DP practice falls into three categories - objective perturbation, gradient perturbation and output perturbation. They suffer from three main problems. First, conditions on objective functions limit objective perturbation in general deep learning tasks. Second, gradient perturbation does not achieve a satisfactory privacy-utility trade-off due to over-injected noise in each epoch. Third, high utility of the output perturbation method is not guaranteed because of the loose upper bound on the global sensitivity of the trained model parameters as the noise scale parameter. To address these problems, we analyse a tighter upper bound on the global sensitivity of the model parameters. Under a black-box setting, based on this global sensitivity, to control the overall noise injection, we propose a novel output perturbation framework by injecting DP noise into a randomly sampled neuron (via the exponential mechanism) at the output layer of a baseline non-private neural network trained with a convexified loss function. We empirically compare the privacy-utility trade-off, measured by accuracy loss to baseline non-private models and the privacy leakage against black-box membership inference (MI) attacks, between our framework and the open-source differentially private stochastic gradient descent (DP-SGD) approaches on six commonly used real-world datasets. The experimental evaluations show that, when the baseline models have observable privacy leakage under MI attacks, our framework achieves a better privacy-utility trade-off than existing DP-SGD implementations, given an overall privacy budget $ε\leq 1$ for a large number of queries.
CRJan 9, 2023
Privacy-Preserving Record Linkage for Cardinality CountingNan Wu, Dinusha Vatsalan, Mohamed Ali Kaafar et al.
Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget ε = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.
CLApr 6, 2023
Those Aren't Your Memories, They're Somebody Else's: Seeding Misinformation in Chat Bot MemoriesConor Atkins, Benjamin Zi Hao Zhao, Hassan Jameel Asghar et al.
One of the new developments in chit-chat bots is a long-term memory mechanism that remembers information from past conversations for increasing engagement and consistency of responses. The bot is designed to extract knowledge of personal nature from their conversation partner, e.g., stating preference for a particular color. In this paper, we show that this memory mechanism can result in unintended behavior. In particular, we found that one can combine a personal statement with an informative statement that would lead the bot to remember the informative statement alongside personal knowledge in its long term memory. This means that the bot can be tricked into remembering misinformation which it would regurgitate as statements of fact when recalling information relevant to the topic of conversation. We demonstrate this vulnerability on the BlenderBot 2 framework implemented on the ParlAI platform and provide examples on the more recent and significantly larger BlenderBot 3 model. We generate 150 examples of misinformation, of which 114 (76%) were remembered by BlenderBot 2 when combined with a personal statement. We further assessed the risk of this misinformation being recalled after intervening innocuous conversation and in response to multiple questions relevant to the injected memory. Our evaluation was performed on both the memory-only and the combination of memory and internet search modes of BlenderBot 2. From the combinations of these variables, we generated 12,890 conversations and analyzed recalled misinformation in the responses. We found that when the chat bot is questioned on the misinformation topic, it was 328% more likely to respond with the misinformation as fact when the misinformation was in the long-term memory.
CRAug 5, 2024
On the Robustness of Malware Detectors to Adversarial SamplesMuhammad Salman, Benjamin Zi Hao Zhao, Hassan Jameel Asghar et al.
Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary.
CRFeb 6
Pro-ZD: A Transferable Graph Neural Network Approach for Proactive Zero-Day Threats MitigationNardine Basta, Firas Ben Hmida, Houssem Jmal et al.
In today's enterprise network landscape, the combination of perimeter and distributed firewall rules governs connectivity. To address challenges arising from increased traffic and diverse network architectures, organizations employ automated tools for firewall rule and access policy generation. Yet, effectively managing risks arising from dynamically generated policies, especially concerning critical asset exposure, remains a major challenge. This challenge is amplified by evolving network structures due to trends like remote users, bring-your-own devices, and cloud integration. This paper introduces a novel graph neural network model for identifying weighted shortest paths. The model aids in detecting network misconfigurations and high-risk connectivity paths that threaten critical assets, potentially exploited in zero-day attacks -- cyber-attacks exploiting undisclosed vulnerabilities. The proposed Pro-ZD framework adopts a proactive approach, automatically fine-tuning firewall rules and access policies to address high-risk connections and prevent unauthorized access. Experimental results highlight the robustness and transferability of Pro-ZD, achieving over 95% average accuracy in detecting high-risk connections. \
AIOct 24, 2024
Can Self Supervision Rejuvenate Similarity-Based Link Prediction?Chenhan Zhang, Weiqi Wang, Zhiyi Tian et al.
Although recent advancements in end-to-end learning-based link prediction (LP) methods have shown remarkable capabilities, the significance of traditional similarity-based LP methods persists in unsupervised scenarios where there are no known link labels. However, the selection of node features for similarity computation in similarity-based LP can be challenging. Less informative node features can result in suboptimal LP performance. To address these challenges, we integrate self-supervised graph learning techniques into similarity-based LP and propose a novel method: Self-Supervised Similarity-based LP (3SLP). 3SLP is suitable for the unsupervised condition of similarity-based LP without the assistance of known link labels. Specifically, 3SLP introduces a dual-view contrastive node representation learning (DCNRL) with crafted data augmentation and node representation learning. DCNRL is dedicated to developing more informative node representations, replacing the node attributes as inputs in the similarity-based LP backbone. Extensive experiments over benchmark datasets demonstrate the salient improvement of 3SLP, outperforming the baseline of traditional similarity-based LP by up to 21.2% (AUC).
CLJun 26, 2024
ConvoCache: Smart Re-Use of Chatbot ResponsesConor Atkins, Ian Wood, Mohamed Ali Kaafar et al.
We present ConvoCache, a conversational caching system that solves the problem of slow and expensive generative AI models in spoken chatbots. ConvoCache finds a semantically similar prompt in the past and reuses the response. In this paper we evaluate ConvoCache on the DailyDialog dataset. We find that ConvoCache can apply a UniEval coherence threshold of 90% and respond to 89% of prompts using the cache with an average latency of 214ms, replacing LLM and voice synthesis that can take over 1s. To further reduce latency we test prefetching and find limited usefulness. Prefetching with 80% of a request leads to a 63% hit rate, and a drop in overall coherence. ConvoCache can be used with any chatbot to reduce costs by reducing usage of generative AI by up to 89%.
CRJul 29, 2021
Empirical Security and Privacy Analysis of Mobile Symptom Checking Applications on Google PlayI Wayan Budi Sentana, Muhammad Ikram, Mohamed Ali Kaafar et al.
Smartphone technology has drastically improved over the past decade. These improvements have seen the creation of specialized health applications, which offer consumers a range of health-related activities such as tracking and checking symptoms of health conditions or diseases through their smartphones. We term these applications as Symptom Checking apps or simply SymptomCheckers. Due to the sensitive nature of the private data they collect, store and manage, leakage of user information could result in significant consequences. In this paper, we use a combination of techniques from both static and dynamic analysis to detect, trace and categorize security and privacy issues in 36 popular SymptomCheckers on Google Play. Our analyses reveal that SymptomCheckers request a significantly higher number of sensitive permissions and embed a higher number of third-party tracking libraries for targeted advertisements and analytics exploiting the privileged access of the SymptomCheckers in which they exist, as a mean of collecting and sharing critically sensitive data about the user and their device. We find that these are sharing the data that they collect through unencrypted plain text to the third-party advertisers and, in some cases, to malicious domains. The results reveal that the exploitation of SymptomCheckers is present in popular apps, still readily available on Google Play.
CRJul 15, 2021
BlockJack: Towards Improved Prevention of IP Prefix Hijacking Attacks in Inter-Domain Routing Via BlockchainI Wayan Budi Sentana, Muhammad Ikram, Mohamed Ali Kaafar
We propose BlockJack, a system based on a distributed and tamper-proof consortium Blockchain that aims at blocking IP prefix hijacking in the Border Gateway Protocol (BGP). In essence, BlockJack provides synchronization among BlockChain and BGP network through interfaces ensuring operational independence and this approach preserving the legacy system and accommodates the impact of a race condition if the Blockchain process exceeds the BGP update interval. BlockJack is also resilient to dynamic routing path changes during the occurrence of the IP prefix hijacking in the routing tables. We implement BlockJack using Hyperledger Fabric Blockchain and Quagga software package and we perform initial sets of experiments to evaluate its efficacy. We evaluate the performance and resilience of BlockJack in various attack scenarios including single path attacks, multiple path attacks, and attacks from random sources in the random network topology. The Evaluation results show that BlockJack is able to handle multiple attacks caused by AS paths changes during a BGP prefix hijacking. In experiment settings with 50 random routers, BlockJack takes on average 0.08 seconds (with a standard deviation of 0.04 seconds) to block BGP prefix hijacking attacks. The test result showing that BlockJack conservative approach feasible to handle the IP Prefix hijacking in the Border Gateway Protocol.
LGMar 12, 2021
On the (In)Feasibility of Attribute Inference Attacks on Machine Learning ModelsBenjamin Zi Hao Zhao, Aviral Agrawal, Catisha Coburn et al.
With an increase in low-cost machine learning APIs, advanced machine learning models may be trained on private datasets and monetized by providing them as a service. However, privacy researchers have demonstrated that these models may leak information about records in the training dataset via membership inference attacks. In this paper, we take a closer look at another inference attack reported in literature, called attribute inference, whereby an attacker tries to infer missing attributes of a partially known record used in the training dataset by accessing the machine learning model as an API. We show that even if a classification model succumbs to membership inference attacks, it is unlikely to be susceptible to attribute inference attacks. We demonstrate that this is because membership inference attacks fail to distinguish a member from a nearby non-member. We call the ability of an attacker to distinguish the two (similar) vectors as strong membership inference. We show that membership inference attacks cannot infer membership in this strong setting, and hence inferring attributes is infeasible. However, under a relaxed notion of attribute inference, called approximate attribute inference, we show that it is possible to infer attributes close to the true attributes. We verify our results on three publicly available datasets, five membership, and three attribute inference attacks reported in literature.
CRFeb 3, 2021
All Infections are Not Created Equal: Time-Sensitive Prediction of Malware Generated Network AttacksZainab Abaid, Dilip Sarkar, Mohamed Ali Kaafar et al.
Many techniques have been proposed for quickly detecting and containing malware-generated network attacks such as large-scale denial of service attacks; unfortunately, much damage is already done within the first few minutes of an attack, before it is identified and contained. There is a need for an early warning system that can predict attacks before they actually manifest, so that upcoming attacks can be prevented altogether by blocking the hosts that are likely to engage in attacks. However, blocking responses may disrupt legitimate processes on blocked hosts; in order to minimise user inconvenience, it is important to also foretell the time when the predicted attacks will occur, so that only the most urgent threats result in auto-blocking responses, while less urgent ones are first manually investigated. To this end, we identify a typical infection sequence followed by modern malware; modelling this sequence as a Markov chain and training it on real malicious traffic, we are able to identify behaviour most likely to lead to attacks and predict 98\% of real-world spamming and port-scanning attacks before they occur. Moreover, using a Semi-Markov chain model, we are able to foretell the time of upcoming attacks, a novel capability that allows accurately predicting the times of 97% of real-world malware attacks. Our work represents an important and timely step towards enabling flexible threat response models that minimise disruption to legitimate users.
CRAug 20, 2020
Not one but many Tradeoffs: Privacy Vs. Utility in Differentially Private Machine LearningBenjamin Zi Hao Zhao, Mohamed Ali Kaafar, Nicolas Kourtellis
Data holders are increasingly seeking to protect their user's privacy, whilst still maximizing their ability to produce machine models with high quality predictions. In this work, we empirically evaluate various implementations of differential privacy (DP), and measure their ability to fend off real-world privacy attacks, in addition to measuring their core goal of providing accurate classifications. We establish an evaluation framework to ensure each of these implementations are fairly evaluated. Our selection of DP implementations add DP noise at different positions within the framework, either at the point of data collection/release, during updates while training of the model, or after training by perturbing learned model parameters. We evaluate each implementation across a range of privacy budgets, and datasets, each implementation providing the same mathematical privacy guarantees. By measuring the models' resistance to real world attacks of membership and attribute inference, and their classification accuracy. we determine which implementations provide the most desirable tradeoff between privacy and utility. We found that the number of classes of a given dataset is unlikely to influence where the privacy and utility tradeoff occurs. Additionally, in the scenario that high privacy constraints are required, perturbing input training data does not trade off as much utility, as compared to noise added later in the ML process.
CRJul 22, 2020
Exploiting Behavioral Side-Channels in Observation Resilient Cognitive Authentication SchemesBenjamin Zi Hao Zhao, Hassan Jameel Asghar, Mohamed Ali Kaafar et al.
Observation Resilient Authentication Schemes (ORAS) are a class of shared secret challenge-response identification schemes where a user mentally computes the response via a cognitive function to authenticate herself such that eavesdroppers cannot readily extract the secret. Security evaluation of ORAS generally involves quantifying information leaked via observed challenge-response pairs. However, little work has evaluated information leaked via human behavior while interacting with these schemes. A common way to achieve observation resilience is by including a modulus operation in the cognitive function. This minimizes the information leaked about the secret due to the many-to-one map from the set of possible secrets to a given response. In this work, we show that user behavior can be used as a side-channel to obtain the secret in such ORAS. Specifically, the user's eye-movement patterns and associated timing information can deduce whether a modulus operation was performed (a fundamental design element), to leak information about the secret. We further show that the secret can still be retrieved if the deduction is erroneous, a more likely case in practice. We treat the vulnerability analytically, and propose a generic attack algorithm that iteratively obtains the secret despite the "faulty" modulus information. We demonstrate the attack on five ORAS, and show that the secret can be retrieved with considerably less challenge-response pairs than non-side-channel attacks (e.g., algebraic/statistical attacks). In particular, our attack is applicable on Mod10, a one-time-pad based scheme, for which no non-side-channel attack exists. We field test our attack with a small-scale eye-tracking user study.
LGMar 18, 2020
The Cost of Privacy in Asynchronous Differentially-Private Machine LearningFarhad Farokhi, Nan Wu, David Smith et al.
We consider training machine learning models using Training data located on multiple private and geographically-scattered servers with different privacy settings. Due to the distributed nature of the data, communicating with all collaborating private data owners simultaneously may prove challenging or altogether impossible. In this paper, we develop differentially-private asynchronous algorithms for collaboratively training machine-learning models on multiple private datasets. The asynchronous nature of the algorithms implies that a central learner interacts with the private data owners one-on-one whenever they are available for communication without needing to aggregate query responses to construct gradients of the entire fitness function. Therefore, the algorithm efficiently scales to many data owners. We define the cost of privacy as the difference between the fitness of a privacy-preserving machine-learning model and the fitness of trained machine-learning model in the absence of privacy concerns. We prove that we can forecast the performance of the proposed privacy-preserving asynchronous algorithms. We demonstrate that the cost of privacy has an upper bound that is inversely proportional to the combined size of the training datasets squared and the sum of the privacy budgets squared. We validate the theoretical results with experiments on financial and medical datasets. The experiments illustrate that collaboration among more than 10 data owners with at least 10,000 records with privacy budgets greater than or equal to 1 results in a superior machine-learning model in comparison to a model trained in isolation on only one of the datasets, illustrating the value of collaboration and the cost of the privacy. The number of the collaborating datasets can be lowered if the privacy budget is higher.
LGJan 29, 2020
Modelling and Quantifying Membership Information Leakage in Machine LearningFarhad Farokhi, Mohamed Ali Kaafar
Machine learning models have been shown to be vulnerable to membership inference attacks, i.e., inferring whether individuals' data have been used for training models. The lack of understanding about factors contributing success of these attacks motivates the need for modelling membership information leakage using information theory and for investigating properties of machine learning models and training algorithms that can reduce membership information leakage. We use conditional mutual information leakage to measure the amount of information leakage from the trained machine learning model about the presence of an individual in the training dataset. We devise an upper bound for this measure of information leakage using Kullback--Leibler divergence that is more amenable to numerical computation. We prove a direct relationship between the Kullback--Leibler membership information leakage and the probability of success for a hypothesis-testing adversary examining whether a particular data record belongs to the training dataset of a machine learning model. We show that the mutual information leakage is a decreasing function of the training dataset size and the regularization weight. We also prove that, if the sensitivity of the machine learning model (defined in terms of the derivatives of the fitness with respect to model parameters) is high, more membership information is potentially leaked. This illustrates that complex models, such as deep neural networks, are more susceptible to membership inference attacks in comparison to simpler models with fewer degrees of freedom. We show that the amount of the membership information leakage is reduced by $\mathcal{O}(\log^{1/2}(δ^{-1})ε^{-1})$ when using Gaussian $(ε,δ)$-differentially-private additive noises.
CRJan 13, 2020
On the Resilience of Biometric Authentication Systems against Random InputsBenjamin Zi Hao Zhao, Hassan Jameel Asghar, Mohamed Ali Kaafar
We assess the security of machine learning based biometric authentication systems against an attacker who submits uniform random inputs, either as feature vectors or raw inputs, in order to find an accepting sample of a target user. The average false positive rate (FPR) of the system, i.e., the rate at which an impostor is incorrectly accepted as the legitimate user, may be interpreted as a measure of the success probability of such an attack. However, we show that the success rate is often higher than the FPR. In particular, for one reconstructed biometric system with an average FPR of 0.03, the success rate was as high as 0.78. This has implications for the security of the system, as an attacker with only the knowledge of the length of the feature space can impersonate the user with less than 2 attempts on average. We provide detailed analysis of why the attack is successful, and validate our results using four different biometric modalities and four different machine learning classifiers. Finally, we propose mitigation techniques that render such attacks ineffective, with little to no effect on the accuracy of the system.
CRAug 28, 2019
On Inferring Training Data Attributes in Machine Learning ModelsBenjamin Zi Hao Zhao, Hassan Jameel Asghar, Raghav Bhaskar et al.
A number of recent works have demonstrated that API access to machine learning models leaks information about the dataset records used to train the models. Further, the work of \cite{somesh-overfit} shows that such membership inference attacks (MIAs) may be sufficient to construct a stronger breed of attribute inference attacks (AIAs), which given a partial view of a record can guess the missing attributes. In this work, we show (to the contrary) that MIA may not be sufficient to build a successful AIA. This is because the latter requires the ability to distinguish between similar records (differing only in a few attributes), and, as we demonstrate, the current breed of MIA are unsuccessful in distinguishing member records from similar non-member records. We thus propose a relaxed notion of AIA, whose goal is to only approximately guess the missing attributes and argue that such an attack is more likely to be successful, if MIA is to be used as a subroutine for inferring training record attributes.
CRJun 24, 2019
The Value of Collaboration in Convex Machine Learning with Differential PrivacyNan Wu, Farhad Farokhi, David Smith et al.
In this paper, we apply machine learning to distributed private data owned by multiple data owners, entities with access to non-overlapping training datasets. We use noisy, differentially-private gradients to minimize the fitness cost of the machine learning model using stochastic gradient descent. We quantify the quality of the trained model, using the fitness cost, as a function of privacy budget and size of the distributed datasets to capture the trade-off between privacy and utility in machine learning. This way, we can predict the outcome of collaboration among privacy-aware data owners prior to executing potentially computationally-expensive machine learning algorithms. Particularly, we show that the difference between the fitness of the trained machine learning model using differentially-private gradient queries and the fitness of the trained machine model in the absence of any privacy concerns is inversely proportional to the size of the training datasets squared and the privacy budget squared. We successfully validate the performance prediction with the actual performance of the proposed privacy-aware learning algorithms, applied to: financial datasets for determining interest rates of loans using regression; and detecting credit card frauds using support vector machines.
CRJun 1, 2019
A Longitudinal Analysis of Online Ad-Blocking BlacklistsSaad Sajid Hashmi, Muhammad Ikram, Mohamed Ali Kaafar
Websites employ third-party ads and tracking services leveraging cookies and JavaScript code, to deliver ads and track users' behavior, causing privacy concerns. To limit online tracking and block advertisements, several ad-blocking (black) lists have been curated consisting of URLs and domains of well-known ads and tracking services. Using Internet Archive's Wayback Machine in this paper, we collect a retrospective view of the Web to analyze the evolution of ads and tracking services and evaluate the effectiveness of ad-blocking blacklists. We propose metrics to capture the efficacy of ad-blocking blacklists to investigate whether these blacklists have been reactive or proactive in tackling the online ad and tracking services. We introduce a stability metric to measure the temporal changes in ads and tracking domains blocked by ad-blocking blacklists, and a diversity metric to measure the ratio of new ads and tracking domains detected. We observe that ads and tracking domains in websites change over time, and among the ad-blocking blacklists that we investigated, our analysis reveals that some blacklists were more informed with the existence of ads and tracking domains, but their rate of change was slower than other blacklists. Our analysis also shows that Alexa top 5K websites in the US, Canada, and the UK have the most number of ads and tracking domains per website, and have the highest proactive scores. This suggests that ad-blocking blacklists are updated by prioritizing ads and tracking domains reported in the popular websites from these countries.
CRMay 22, 2019
DaDiDroid: An Obfuscation Resilient Tool for Detecting Android Malware via Weighted Directed Call Graph ModellingMuhammad Ikram, Pierrick Beaume, Mohamed Ali Kaafar
With the number of new mobile malware instances increasing by over 50\% annually since 2012 [24], malware embedding in mobile apps is arguably one of the most serious security issues mobile platforms are exposed to. While obfuscation techniques are successfully used to protect the intellectual property of apps' developers, they are unfortunately also often used by cybercriminals to hide malicious content inside mobile apps and to deceive malware detection tools. As a consequence, most of mobile malware detection approaches fail in differentiating between benign and obfuscated malicious apps. We examine the graph features of mobile apps code by building weighted directed graphs of the API calls, and verify that malicious apps often share structural similarities that can be used to differentiate them from benign apps, even under a heavily "polluted" training set where a large majority of the apps are obfuscated. We present DaDiDroid an Android malware app detection tool that leverages features of the weighted directed graphs of API calls to detect the presence of malware code in (obfuscated) Android apps. We show that DaDiDroid significantly outperforms MaMaDroid [23], a recently proposed malware detection tool that has been proven very efficient in detecting malware in a clean non-obfuscated environment. We evaluate DaDiDroid's accuracy and robustness against several evasion techniques using various datasets for a total of 43,262 benign and 20,431 malware apps. We show that DaDiDroid correctly labels up to 96% of Android malware samples, while achieving an 91% accuracy with an exclusive use of a training set of obfuscated apps.
CRApr 24, 2019
A Decade of Mal-Activity Reporting: A Retrospective Analysis of Internet Malicious Activity BlacklistsBenjamin Zi Hao Zhao, Muhammad Ikram, Hassan Jameel Asghar et al.
This paper focuses on reporting of Internet malicious activity (or mal-activity in short) by public blacklists with the objective of providing a systematic characterization of what has been reported over the years, and more importantly, the evolution of reported activities. Using an initial seed of 22 blacklists, covering the period from January 2007 to June 2017, we collect more than 51 million mal-activity reports involving 662K unique IP addresses worldwide. Leveraging the Wayback Machine, antivirus (AV) tool reports and several additional public datasets (e.g., BGP Route Views and Internet registries) we enrich the data with historical meta-information including geo-locations (countries), autonomous system (AS) numbers and types of mal-activity. Furthermore, we use the initially labelled dataset of approx 1.57 million mal-activities (obtained from public blacklists) to train a machine learning classifier to classify the remaining unlabeled dataset of approx 44 million mal-activities obtained through additional sources. We make our unique collected dataset (and scripts used) publicly available for further research. The main contributions of the paper are a novel means of report collection, with a machine learning approach to classify reported activities, characterization of the dataset and, most importantly, temporal analysis of mal-activity reporting behavior. Inspired by P2P behavior modeling, our analysis shows that some classes of mal-activities (e.g., phishing) and a small number of mal-activity sources are persistent, suggesting that either blacklist-based prevention systems are ineffective or have unreasonably long update periods. Our analysis also indicates that resources can be better utilized by focusing on heavy mal-activity contributors, which constitute the bulk of mal-activities.
CRFeb 4, 2019
Differentially Private Release of High-Dimensional Datasets using the Gaussian CopulaHassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo et al.
We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99\% of the query answers having absolute (fractional) error rates between 0.01 to 3\%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively.
CRJan 23, 2019
The Chain of Implicit Trust: An Analysis of the Web Third-party Resources LoadingMuhammad Ikram, Rahat Masood, Gareth Tyson et al.
The Web is a tangled mass of interconnected services, where websites import a range of external resources from various third-party domains. However, the latter can further load resources hosted on other domains. For each website, this creates a dependency chain underpinned by a form of implicit trust between the first-party and transitively connected third-parties. The chain can only be loosely controlled as first-party websites often have little, if any, visibility of where these resources are loaded from. This paper performs a large-scale study of dependency chains in the Web, to find that around 50% of first-party websites render content that they did not directly load. Although the majority (84.91%) of websites have short dependency chains (below 3 levels), we find websites with dependency chains exceeding 30. Using VirusTotal, we show that 1.2% of these third-parties are classified as suspicious --- although seemingly small, this limited set of suspicious third-parties have remarkable reach into the wider ecosystem. By running sandboxed experiments, we observe a range of activities with the majority of suspicious JavaScript downloading malware; worryingly, we find this propensity is greater among implicitly trusted JavaScripts.
CRSep 19, 2018
Gwardar: Towards Protecting a Software-Defined Network from Malicious Network Operating SystemsArash Shaghaghi, Salil S. Kanhere, Mohamed Ali Kaafar et al.
A Software-Defined Network (SDN) controller (aka. Network Operating System or NOS) is regarded as the brain of the network and is the single most critical element responsible to manage an SDN. Complimentary to existing solutions that aim to protect a NOS, we propose an intrusion protection system designed to protect an SDN against a controller that has been successfully compromised. Gwardar maintains a virtual replica of the data plane by intercepting the OpenFlow messages exchanged between the control and data plane. By observing the long-term flow of the packets, Gwardar learns the normal set of trajectories in the data plane for distinct packet headers. Upon detecting an unexpected packet trajectory, it starts by verifying the data plane forwarding devices by comparing the actual packet trajectories with the expected ones computed over the virtual replica. If the anomalous trajectories match the NOS instructions, Gwardar inspects the NOS itself. For this, it submits policies matching the normal set of trajectories and verifies whether the controller submits matching flow rules to the data plane and whether the network view provided to the application plane reflects the changes. Our evaluation results prove the practicality of Gwardar with a high detection accuracy in a reasonable time-frame.
CRJul 7, 2018
Gargoyle: A Network-based Insider Attack Resilient Framework for OrganizationsArash Shaghaghi, Salil S. Kanhere, Mohamed Ali Kaafar et al.
`Anytime, Anywhere' data access model has become a widespread IT policy in organizations making insider attacks even more complicated to model, predict and deter. Here, we propose Gargoyle, a network-based insider attack resilient framework against the most complex insider threats within a pervasive computing context. Compared to existing solutions, Gargoyle evaluates the trustworthiness of an access request context through a new set of contextual attributes called Network Context Attribute (NCA). NCAs are extracted from the network traffic and include information such as the user's device capabilities, security-level, current and prior interactions with other devices, network connection status, and suspicious online activities. Retrieving such information from the user's device and its integrated sensors are challenging in terms of device performance overheads, sensor costs, availability, reliability and trustworthiness. To address these issues, Gargoyle leverages the capabilities of Software-Defined Network (SDN) for both policy enforcement and implementation. In fact, Gargoyle's SDN App can interact with the network controller to create a `defence-in-depth' protection system. For instance, Gargoyle can automatically quarantine a suspicious data requestor in the enterprise network for further investigation or filter out an access request before engaging a data provider. Finally, instead of employing simplistic binary rules in access authorizations, Gargoyle incorporates Function-based Access Control (FBAC) and supports the customization of access policies into a set of functions (e.g., disabling copy, allowing print) depending on the perceived trustworthiness of the context.
MLJun 6, 2018
Not All Attributes are Created Equal: $d_{\mathcal{X}}$-Private Mechanisms for Linear QueriesParameswaran Kamalaruban, Victor Perrier, Hassan Jameel Asghar et al.
Differential privacy provides strong privacy guarantees simultaneously enabling useful insights from sensitive datasets. However, it provides the same level of protection for all elements (individuals and attributes) in the data. There are practical scenarios where some data attributes need more/less protection than others. In this paper, we consider $d_{\mathcal{X}}$-privacy, an instantiation of the privacy notion introduced in \cite{chatzikokolakis2013broadening}, which allows this flexibility by specifying a separate privacy budget for each pair of elements in the data domain. We describe a systematic procedure to tailor any existing differentially private mechanism that assumes a query set and a sensitivity vector as input into its $d_{\mathcal{X}}$-private variant, specifically focusing on linear queries. Our proposed meta procedure has broad applications as linear queries form the basis of a range of data analysis and machine learning algorithms, and the ability to define a more flexible privacy budget across the data domain results in improved privacy/utility tradeoff in these applications. We propose several $d_{\mathcal{X}}$-private mechanisms, and provide theoretical guarantees on the trade-off between utility and privacy. We also experimentally demonstrate the effectiveness of our procedure, by evaluating our proposed $d_{\mathcal{X}}$-private Laplace mechanism on both synthetic and real datasets using a set of randomly generated linear queries.
CRApr 1, 2018
Software-Defined Network (SDN) Data Plane Security: Issues, Solutions and Future DirectionsArash Shaghaghi, Mohamed Ali Kaafar, Rajkumar Buyya et al.
Software-Defined Network (SDN) radically changes the network architecture by decoupling the network logic from the underlying forwarding devices. This architectural change rejuvenates the network-layer granting centralized management and re-programmability of the networks. From a security perspective, SDN separates security concerns into control and data plane, and this architectural recomposition brings up exciting opportunities and challenges. The overall perception is that SDN capabilities will ultimately result in improved security. However, in its raw form, SDN could potentially make networks more vulnerable to attacks and harder to protect. In this paper, we focus on identifying challenges faced in securing the data plane of SDN - one of the least explored but most critical components of this technology. We formalize this problem space, identify potential attack scenarios while highlighting possible vulnerabilities and establish a set of requirements and challenges to protect the data plane of SDNs. Moreover, we undertake a survey of existing solutions with respect to the identified threats, identifying their limitations and offer future research directions.
CRSep 9, 2017
A First Look at Ad Blocking Apps on Google PlayMuhammad Ikram, Mohamed Ali Kaafar
Online advertisers and analytics services (or trackers), are constantly tracking users activities as they access web services either through browsers or a mobile apps. Numerous tools such as browser plugins and specialized mobile apps have been proposed to limit intrusive advertisements and prevent tracking on desktop computing and mobile phones. For desktop computing, browser plugins are heavily studied for their usability and efficiency issues, however, tools that block ads and prevent trackers in mobile platforms, have received the least or no attention. In this paper, we present a first look at 97 Android adblocking apps (or adblockers), extracted from more than 1.5 million apps from Google Play, that promise to block advertisements and analytics services. With our data collection and analysis pipeline of the Android adblockers, we reveal the presences of third-party tracking libraries and sensitive permissions for critical resources on user mobile devices as well as have malware in the source codes. We analyze users' reviews for the in-effectiveness of adblockers in terms of not blocking ads and trackers. We found that a significant fraction of adblockers are not fulfilling their advertised functionality.
CRAug 18, 2017
WedgeTail: An Intrusion Prevention System for the Data Plane of Software Defined NetworksArash Shaghaghi, Mohamed Ali Kaafar, Sanjay Jha
Networks are vulnerable to disruptions caused by malicious forwarding devices. The situation is likely to worsen in Software Defined Networks (SDNs) with the incompatibility of existing solutions, use of programmable soft switches and the potential of bringing down an entire network through compromised forwarding devices. In this paper, we present WedgeTail, an Intrusion Prevention System (IPS) designed to secure the SDN data plane. WedgeTail regards forwarding devices as points within a geometric space and stores the path packets take when traversing the network as trajectories. To be efficient, it prioritizes forwarding devices before inspection using an unsupervised trajectory-based sampling mechanism. For each of the forwarding device, WedgeTail computes the expected and actual trajectories of packets and `hunts' for any forwarding device not processing packets as expected. Compared to related work, WedgeTail is also capable of distinguishing between malicious actions such as packet drop and generation. Moreover, WedgeTail employs a radically different methodology that enables detecting threats autonomously. In fact, it has no reliance on pre-defined rules by an administrator and may be easily imported to protect SDN networks with different setups, forwarding devices, and controllers. We have evaluated WedgeTail in simulated environments, and it has been capable of detecting and responding to all implanted malicious forwarding devices within a reasonable time-frame. We report on the design, implementation, and evaluation of WedgeTail in this manuscript.
NIAug 14, 2017
uStash: a Novel Mobile Content Delivery System for Improving User QoE in Public TransportFang-Zhou Jiang, Kanchana Thilakarathna, Sirine Mrabet et al.
Mobile data traffic is growing exponentially and it is even more challenging to distribute content efficiently while users are "on the move" such as in public transport.The use of mobile devices for accessing content (e.g. videos) while commuting are both expensive and unreliable, although it is becoming common practice worldwide. Leveraging on the spatial and temporal correlation of content popularity and users' diverse network connectivity, we propose a novel content distribution system, \textit{uStash}, which guarantees better QoE with regards to access delays and cost of usage. The proposed collaborative download and content stashing schemes provide the uStash provider the flexibility to control the cost of content access via cellular networks. We model the uStash system in a probabilistic framework and thereby analytically derive the optimal portions for collaborative downloading. Then, we validate the proposed models using real-life trace driven simulations. In particular, we use dataset from 22 inter-city buses running on 6 different routes and from a mobile VoD service provider to show that uStash reduces the cost of monthly cellular data by approximately 50\% and the expected delay for content access by 60\% compared to content downloaded via users' cellular network connections.
IRJul 5, 2017
Graph Based Recommendations: From Data Representation to Feature Extraction and ApplicationAmit Tiroshi, Tsvi Kuflik, Shlomo Berkovsky et al.
Modeling users for the purpose of identifying their preferences and then personalizing services on the basis of these models is a complex task, primarily due to the need to take into consideration various explicit and implicit signals, missing or uncertain information, contextual aspects, and more. In this study, a novel generic approach for uncovering latent preference patterns from user data is proposed and evaluated. The approach relies on representing the data using graphs, and then systematically extracting graph-based features and using them to enrich the original user models. The extracted features encapsulate complex relationships between users, items, and metadata. The enhanced user models can then serve as an input to any recommendation algorithm. The proposed approach is domain-independent (demonstrated on data from movies, music, and business recommender systems), and is evaluated using several state-of-the-art machine learning methods, on different recommendation tasks, and using different evaluation metrics. The results show a unanimous improvement in the recommendation accuracy across tasks and domains. In addition, the evaluation provides a deeper analysis regarding the performance of the approach in special scenarios, including high sparsity and variability of ratings.
CRJul 5, 2017
More Flexible Differential Privacy: The Application of Piecewise Mixture Distributions in Query ReleaseDavid B. Smith, Kanchana Thilakarathna, Mohamed Ali Kaafar
There is an increasing demand to make data "open" to third parties, as data sharing has great benefits in data-driven decision making. However, with a wide variety of sensitive data collected, protecting privacy of individuals, communities and organizations, is an essential factor in making data "open". The approaches currently adopted by industry in releasing private data are often ad hoc and prone to a number of attacks, including re-identification attacks, as they do not provide adequate privacy guarantees. While differential privacy has attracted significant interest from academia and industry by providing rigorous and reliable privacy guarantees, the reduced utility and inflexibility of current differentially private algorithms for data release is a barrier to their use in real-life. This paper aims to address these two challenges. First, we propose a novel mechanism to augment the conventional utility of differential privacy by fusing two Laplace or geometric distributions together. We derive closed form expressions for entropy, variance of added noise, and absolute expectation of noise for the proposed piecewise mixtures. Then the relevant distributions are utilised to theoretically prove the privacy and accuracy guarantees of the proposed mechanisms. Second, we show that our proposed mechanisms have greater flexibility, with three parameters to adjust, giving better utility in bounding noise, and mitigating larger inaccuracy, in comparison to typical one-parameter differentially private mechanisms. We then empirically evaluate the performance of piecewise mixture distributions with extensive simulations and with a real-world dataset for both linear count queries and histogram queries. The empirical results show an increase in all utility measures considered, while maintaining privacy, for the piecewise mixture mechanisms compared to standard Laplace or geometric mechanisms.
CRMay 24, 2017
On the Privacy of the Opal Data Release: A ResponseHassan Jameel Asghar, Paul Tyler, Mohamed Ali Kaafar
This document is a response to a report from the University of Melbourne on the privacy of the Opal dataset release. The Opal dataset was released by Data61 (CSIRO) in conjunction with the Transport for New South Wales (TfNSW). The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the smart ticketing system, known as Opal, available in the state of New South Wales, Australia.
CRMay 16, 2017
Differentially Private Release of Public Transport Data: The Opal Use CaseHassan Jameel Asghar, Paul Tyler, Mohamed Ali Kaafar
This document describes the application of a differentially private algorithm to release public transport usage data from Transport for New South Wales (TfNSW), Australia. The data consists of two separate weeks of "tap-on/tap-off" data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the smart ticketing system, known as Opal, available in the state of New South Wales, Australia.
CROct 28, 2016
BehavioCog: An Observation Resistant Authentication SchemeJagmohan Chauhan, Benjamin Zi Hao Zhao, Hassan Jameel Asghar et al.
We propose that by integrating behavioural biometric gestures---such as drawing figures on a touch screen---with challenge-response based cognitive authentication schemes, we can benefit from the properties of both. On the one hand, we can improve the usability of existing cognitive schemes by significantly reducing the number of challenge-response rounds by (partially) relying on the hardness of mimicking carefully designed behavioural biometric gestures. On the other hand, the observation resistant property of cognitive schemes provides an extra layer of protection for behavioural biometrics; an attacker is unsure if a failed impersonation is due to a biometric failure or a wrong response to the challenge. We design and develop an instantiation of such a "hybrid" scheme, and call it BehavioCog. To provide security close to a 4-digit PIN---one in 10,000 chance to impersonate---we only need two challenge-response rounds, which can be completed in less than 38 seconds on average (as estimated in our user study), with the advantage that unlike PINs or passwords, the scheme is secure under observation.
CRMay 12, 2016
SplitBox: Toward Efficient Private Network Function VirtualizationHassan Jameel Asghar, Luca Melis, Cyril Soldani et al.
This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that this is processed in a distributed manner by multiple honest-but-curious providers. Then, we introduce our SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick -- an extension of the Click modular router -- using a firewall as a use case. Our experimental results show that SplitBox achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules.
CRMar 20, 2016
Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class LearningMuhammad Ikram, Hassan Jameel Asghar, Mohamed Ali Kaafar et al.
Numerous tools have been developed to aggressively block the execution of popular JavaScript programs (JS) in Web browsers. Such blocking also affects functionality of webpages and impairs user experience. As a consequence, many privacy preserving tools (PP-Tools) that have been developed to limit online tracking, often executed via JS, may suffer from poor performance and limited uptake. A mechanism that can isolate JS necessary for proper functioning of the website from tracking JS would thus be useful. Through the use of a manually labelled dataset composed of 2,612 JS, we show how current PP-Tools are ineffective in finding the right balance between blocking tracking JS and allowing functional JS. To the best of our knowledge, this is the first study to assess the performance of current web PP-Tools. To improve this balance, we examine the two classes of JS and hypothesize that tracking JS share structural similarities that can be used to differentiate them from functional JS. The rationale of our approach is that web developers often borrow and customize existing pieces of code in order to embed tracking (resp. functional) JS into their webpages. We then propose one-class machine learning classifiers using syntactic and semantic features extracted from JS. When trained only on samples of tracking JS, our classifiers achieve an accuracy of 99%, where the best of the PP-Tools achieved an accuracy of 78%. We further test our classifiers and several popular PP-Tools on a corpus of 4K websites with 135K JS. The output of our best classifier on this data is between 20 to 64% different from the PP-Tools. We manually analyse a sample of the JS for which our classifier is in disagreement with all other PP-Tools, and show that our approach is not only able to enhance user web experience by correctly classifying more functional JS, but also discovers previously unknown tracking services.
CRJan 25, 2016
Private Processing of Outsourced Network Functions: Feasibility and ConstructionsLuca Melis, Hassan Jameel Asghar, Emiliano De Cristofaro et al.
Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setting implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this paper, we investigate the use of cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information. More specifically, we present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using partial homomorphic encryption and public-key encryption with keyword search. We include a proof-of-concept implementation of our constructions and show that network functions can be privately processed by an untrusted cloud provider in a few milliseconds.
CRNov 2, 2015
TLS in the wild: an Internet-wide analysis of TLS-based protocols for electronic communicationRalph Holz, Johanna Amann, Olivier Mehani et al.
The majority of electronic communication today happens either via email or chat. Thanks to the use of standardised protocols electronic mail (SMTP, IMAP, POP3) and instant chat (XMPP, IRC) servers can be deployed in a decentralised but interoperable fashion. These protocols can be secured by providing encryption with the use of TLS---directly or via the STARTTLS extension---and leverage X.509 PKIs or ad hoc methods to authenticate communication peers. However, many combination of these mechanisms lead to insecure deployments. We present the largest study to date that investigates the security of the email and chat infrastructures. We used active Internet-wide scans to determine the amount of secure service deployments, and passive monitoring to investigate if user agents actually use this opportunity to secure their communications. We addressed both the client-to-server interactions as well as server-to-server forwarding mechanisms that these protocols offer, and the use of encryption and authentication methods in the process. Our findings shed light on an insofar unexplored area of the Internet. The truly frightening result is that most of our communication is poorly secured in transit.
HCJul 7, 2015
The Web for Under-Powered Mobile Devices: Lessons learned from Google GlassJagmohan Chauhan, Mohamed Ali Kaafar, Anirban Mahanti
This paper examines some of the potential challenges associated with enabling a seamless web experience on underpowered mobile devices such as Google Glass from the perspective of web content providers, device, and the network. We conducted experiments to study the impact of webpage complexity, individual web components and different application layer protocols while accessing webpages on the performance of Glass browser, by measuring webpage load time, temperature variation and power consumption and compare it to a smartphone. Our findings suggest that (a) performance of Glass compared to a smartphone in terms of power consumption and webpage load time deteriorates with increasing webpage complexity (b) execution time for popular JavaScript benchmarks is about 3-8 times higher on Glass compared to a smartphone, (c) WebP is more energy efficient image format than JPEG and PNG, and (d) seven out of 50 websites studied are optimized for content delivery to Glass.
CYMay 7, 2015
Characterizing Key Stakeholders in an Online Black-Hat MarketplaceShehroze Farooqi, Muhammad Ikram, Emiliano De Cristofaro et al.
Over the past few years, many black-hat marketplaces have emerged that facilitate access to reputation manipulation services such as fake Facebook likes, fraudulent search engine optimization (SEO), or bogus Amazon reviews. In order to deploy effective technical and legal countermeasures, it is important to understand how these black-hat marketplaces operate, shedding light on the services they offer, who is selling, who is buying, what are they buying, who is more successful, why are they successful, etc. Toward this goal, in this paper, we present a detailed micro-economic analysis of a popular online black-hat marketplace, namely, SEOClerks.com. As the site provides non-anonymized transaction information, we set to analyze selling and buying behavior of individual users, propose a strategy to identify key users, and study their tactics as compared to other (non-key) users. We find that key users: (1) are mostly located in Asian countries, (2) are focused more on selling black-hat SEO services, (3) tend to list more lower priced services, and (4) sometimes buy services from other sellers and then sell at higher prices. Finally, we discuss the implications of our analysis with respect to devising effective economic and legal intervention strategies against marketplace operators and key users.
CRDec 9, 2014
Gesture-based Continuous Authentication for Wearable Devices: the Google Glass CaseJagmohan Chauhan, Hassan Jameel Asghar, Mohamed Ali Kaafar et al.
We study the feasibility of touch gesture behavioural biometrics for implicit authentication of users on a smartglass (Google Glass) by proposing a continuous authentication system using two classifiers: SVM with RBF kernel, and a new classifier based on Chebyshev's concentration inequality. Based on data collected from 30 volunteers, we show that such authentication is feasible both in terms of classification accuracy and computational load on smartglasses. We achieve a classification accuracy of up to 99% with only 75 training samples using behavioural biometric data from four different types of touch gestures. To show that our system can be generalized, we test its performance on touch data from smartphones and found the accuracy to be similar to smartglasses. Finally, our experiments on the permanence of gestures show that the negative impact of changing user behaviour with time on classification accuracy can be best alleviated by periodically replacing older training samples with new randomly chosen samples.
SISep 7, 2014
Paying for Likes? Understanding Facebook Like Fraud Using HoneypotsEmiliano De Cristofaro, Arik Friedman, Guillaume Jourjon et al.
Facebook pages offer an easy way to reach out to a very large audience as they can easily be promoted using Facebook's advertising platform. Recently, the number of likes of a Facebook page has become a measure of its popularity and profitability, and an underground market of services boosting page likes, aka like farms, has emerged. Some reports have suggested that like farms use a network of profiles that also like other pages to elude fraud protection algorithms, however, to the best of our knowledge, there has been no systematic analysis of Facebook pages' promotion methods. This paper presents a comparative measurement study of page likes garnered via Facebook ads and by a few like farms. We deploy a set of honeypot pages, promote them using both methods, and analyze garnered likes based on likers' demographic, temporal, and social characteristics. We highlight a few interesting findings, including that some farms seem to be operated by bots and do not really try to hide the nature of their operations, while others follow a stealthier approach, mimicking regular users' behavior.
IRJun 11, 2014
Are 140 Characters Enough? A Large-Scale Linkability Study of TweetsMishari Almishari, Mohamed Ali Kaafar, Gene Tsudik et al.
Microblogging is a very popular Internet activity that informs and entertains great multitudes of people world-wide via quickly and scalably disseminated terse messages containing all kinds of newsworthy utterances. Even though microblogging is neither designed nor meant to emphasize privacy, numerous contributors hide behind pseudonyms and compartmentalize their different incarnations via multiple accounts within the same, or across multiple, site(s). Prior work has shown that stylometric analysis is a very powerful tool capable of linking product or service reviews and blogs that are produced by the same author when the number of authors is large. In this paper, we explore linkability of tweets. Our results, based on a very large corpus of tweets, clearly demonstrate that, at least for relatively active tweeters, linkability of tweets by the same author is easily attained even when the number of tweeters is large. We also show that our linkability results hold for a set of actual Twitter users who tweet from multiple accounts. This has some obvious privacy implications, both positive and negative.
CYFeb 14, 2014
Censorship in the Wild: Analyzing Internet Filtering in SyriaAbdelberi Chaabane, Terence Chen, Mathieu Cunche et al.
Internet censorship is enforced by numerous governments worldwide, however, due to the lack of publicly available information, as well as the inherent risks of performing active measurements, it is often hard for the research community to investigate censorship practices in the wild. Thus, the leak of 600GB worth of logs from 7 Blue Coat SG-9000 proxies, deployed in Syria to filter Internet traffic at a country scale, represents a unique opportunity to provide a detailed snapshot of a real-world censorship ecosystem. This paper presents the methodology and the results of a measurement analysis of the leaked Blue Coat logs, revealing a relatively stealthy, yet quite targeted, censorship. We find that traffic is filtered in several ways: using IP addresses and domain names to block subnets or websites, and keywords or categories to target specific content. We show that keyword-based censorship produces some collateral damage as many requests are blocked even if they do not relate to sensitive content. We also discover that Instant Messaging is heavily censored, while filtering of social media is limited to specific pages. Finally, we show that Syrian users try to evade censorship by using web/socks proxies, Tor, VPNs, and BitTorrent. To the best of our knowledge, our work provides the first analytical look into Internet filtering in Syria.