Tom van Goethem

4papers

1,319citations

Novelty51%

AI Score27

Ranked #158,526 of 201,326 authors (top 79%)#4,556 in CR (top 62%)

4 Papers

CRFeb 18, 2021

The CNAME of the Game: Large-scale Analysis of DNS-based Tracking Evasion

Yana Dimova, Gunes Acar, Lukasz Olejnik et al.

Online tracking is a whack-a-mole game between trackers who build and monetize behavioral user profiles through intrusive data collection, and anti-tracking mechanisms, deployed as a browser extension, built-in to the browser, or as a DNS resolver. As a response to pervasive and opaque online tracking, more and more users adopt anti-tracking tools to preserve their privacy. Consequently, as the information that trackers can gather on users is being curbed, some trackers are looking for ways to evade these tracking countermeasures. In this paper we report on a large-scale longitudinal evaluation of an anti-tracking evasion scheme that leverages CNAME records to include tracker resources in a same-site context, effectively bypassing anti-tracking measures that use fixed hostname-based block lists. Using historical HTTP Archive data we find that this tracking scheme is rapidly gaining traction, especially among high-traffic websites. Furthermore, we report on several privacy and security issues inherent to the technical setup of CNAME-based tracking that we detected through a combination of automated and manual analyses. We find that some trackers are using the technique against the Safari browser, which is known to include strict anti-tracking configurations. Our findings show that websites using CNAME trackers must take extra precautions to avoid leaking sensitive information to third parties.

CRJun 4, 2018

Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation

Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob et al.

In order to evaluate the prevalence of security and privacy practices on a representative sample of the Web, researchers rely on website popularity rankings such as the Alexa list. While the validity and representativeness of these rankings are rarely questioned, our findings show the contrary: we show for four main rankings how their inherent properties (similarity, stability, representativeness, responsiveness and benignness) affect their composition and therefore potentially skew the conclusions made in studies. Moreover, we find that it is trivial for an adversary to manipulate the composition of these lists. We are the first to empirically validate that the ranks of domains in each of the lists are easily altered, in the case of Alexa through as little as a single HTTP request. This allows adversaries to manipulate rankings on a large scale and insert malicious domains into whitelists or bend the outcome of research studies to their will. To overcome the limitations of such rankings, we propose improvements to reduce the fluctuations in list composition and guarantee better defenses against manipulation. To allow the research community to work with reliable and reproducible rankings, we provide Tranco, an improved ranking that we offer through an online service available at https://tranco-list.eu.

CRAug 22, 2017

Herding Vulnerable Cats: A Statistical Approach to Disentangle Joint Responsibility for Web Security in Shared Hosting

Samaneh Tajalizadehkhoob, Tom van Goethem, Maciej Korczyński et al.

Hosting providers play a key role in fighting web compromise, but their ability to prevent abuse is constrained by the security practices of their own customers. {\em Shared} hosting, offers a unique perspective since customers operate under restricted privileges and providers retain more control over configurations. We present the first empirical analysis of the distribution of web security features and software patching practices in shared hosting providers, the influence of providers on these security practices, and their impact on web compromise rates. We construct provider-level features on the global market for shared hosting -- containing 1,259 providers -- by gathering indicators from 442,684 domains. Exploratory factor analysis of 15 indicators identifies four main latent factors that capture security efforts: content security, webmaster security, web infrastructure security and web application security. We confirm, via a fixed-effect regression model, that providers exert significant influence over the latter two factors, which are both related to the software stack in their hosting environment. Finally, by means of GLM regression analysis of these factors on phishing and malware abuse, we show that the four security and software patching factors explain between 10\% and 19\% of the variance in abuse at providers, after controlling for size. For web-application security for instance, we found that when a provider moves from the bottom 10\% to the best-performing 10\%, it would experience 4 times fewer phishing incidents. We show that providers have influence over patch levels--even higher in the stack, where CMSes can run as client-side software--and that this influence is tied to a substantial reduction in abuse levels.

CRAug 21, 2017

Automated Website Fingerprinting through Deep Learning

Vera Rimmer, Davy Preuveneers, Marc Juarez et al.

Several studies have shown that the network traffic that is generated by a visit to a website over Tor reveals information specific to the website through the timing and sizes of network packets. By capturing traffic traces between users and their Tor entry guard, a network eavesdropper can leverage this meta-data to reveal which website Tor users are visiting. The success of such attacks heavily depends on the particular set of traffic features that are used to construct the fingerprint. Typically, these features are manually engineered and, as such, any change introduced to the Tor network can render these carefully constructed features ineffective. In this paper, we show that an adversary can automate the feature engineering process, and thus automatically deanonymize Tor traffic by applying our novel method based on deep learning. We collect a dataset comprised of more than three million network traces, which is the largest dataset of web traffic ever used for website fingerprinting, and find that the performance achieved by our deep learning approaches is comparable to known methods which include various research efforts spanning over multiple years. The obtained success rate exceeds 96% for a closed world of 100 websites and 94% for our biggest closed world of 900 classes. In our open world evaluation, the most performant deep learning model is 2% more accurate than the state-of-the-art attack. Furthermore, we show that the implicit features automatically learned by our approach are far more resilient to dynamic changes of web content over time. We conclude that the ability to automatically construct the most relevant traffic features and perform accurate traffic recognition makes our deep learning based approach an efficient, flexible and robust technique for website fingerprinting.