86.6NIMay 29
Where's Waldo Library? Using Reverse IP Geolocation to Identify Library IPsNishant Acharya, Anyu Yang, Humaira Fasih Ahmed Hashmi et al.
Community anchor institutions (CAIs), such as libraries, schools, and community centers, are critical for providing Internet access to un- or under-served individuals and communities. Because many of these institutions are themselves under-provisioned, analyzing the reliability and quality of their Internet service is important. Doing so at scale requires knowing the IP addresses of these institutions so that broadband measurement and policy evaluation can occur. Unfortunately, these IPs are not systematically documented. As a first step towards widespread, scalable evaluation of CAI Internet connectivity, this paper presents Reverse IP Geolocation (RG), a new framework to infer IP addresses from physical address data. A key insight is that CAI street addresses are publicly known, which allows us to identify a candidate set of IPs from commercial geolocation that are likely serving the location associated with a CAI. In this paper, \textbf{we focus on US public libraries}, which offer both geographic diversity across thousands of locations, and some publicly available institutional records (\eg{}WHOIS registrations) that enable systematic validation of our approach. Our approach offers a novel integration of IP geolocation databases, DNS PTR records, WHOIS registrations, broadband provider data, and active measurements to identify IPs likely assigned to libraries and validate them. Based on evaluations, our approach can map a library to its IP prefix approx. half of the time, with coverage across all US states, as well as urban and rural areas. Our results highlight the feasibility of mapping CAI presence in IP space and offer a foundation for large-scale, remote broadband infrastructure evaluation.
50.9NIMay 29
Not All Roads Lead to Rome: How VPN Selection Alters What We Measure and Infer about Web InfrastructureSachin Kumar Singh, Robert Ricci, Alexander Gamero-Garrido
Web-measurement studies treat commercial VPNs as interchangeable vantage points within a country, assuming that any VPN in a particular country is as good as any other. We show that this assumption does not hold: the same country measured through different VPN providers yields materially different conclusions about where endpoints sit, who hosts them, and which physical replicas serve them. Using large-scale browser-based measurements across fourteen countries and four major VPN providers, complemented by targeted DNS and replica-selection probes, we examine sources of this variability across three layers of the VPN-to-endpoint path: vantage identity, name resolution, and replica selection. We find that the variability is driven primarily by layers below the client: commercial VPN providers operate their own in-country DNS infrastructure, often intercepting queries regardless of client configuration; CDNs steer on the exit network, sending identical queries to different replicas; and peering paths route identical DNS answers to different physical facilities. We distill these findings into a set of reporting practices for VPN-based Web measurement.
85.9NIMay 29
Stratifying the Digital Divide: Analysis of Socio-Economic Influences on Internet PerformanceShivani Kalamadi, Aditya Bej, Sachin Kumar Singh et al.
Despite numerous technological advancements, the digital divide remains a pressing issue affecting millions worldwide. We present a framework for diagnosing internet inequality at the Census Block Group level by pairing approximately 170 million crowdsourced Ookla speed tests (2021--2025) with U.S. Census demographics across six metropolitan regions. After quantifying and correcting for sampling bias, we use Random Forest regression with permutation importance to identify the socio-economic drivers of download speed, upload speed, and latency. Population density dominates all three metrics at the regional level, but this dominance is an artifact of scale: once areas are stratified into density bins, its influence vanishes in medium- and higher-density neighborhoods, revealing that socio-economic conditions are the true differentiators of internet quality in most urban settings. After controlling for density, income and racial composition emerge as the primary drivers, income consistently dictating upload speed and racial composition proving to be a stronger predictor of download speed than either income or education. Our findings demonstrate that internet inequality is locally configured: no single national narrative explains it, and effective policy demands region-specific intervention.
59.1CRApr 30
SST-Guard: Detecting and Characterizing Server-Side Google Analytics in the WildMuhammad Jazlan, Alexander Gamero-Garrido, Zubair Shafiq et al.
As web browsers increasingly restrict client-side tracking, the web tracking ecosystem is shifting from client-side to server-side tracking (SST). In SST, the browser sends tracking requests to an intermediate endpoint, which then forwards them to the tracker's endpoint, eliminating direct client-to-tracker requests. As a result, existing tracking protections that block requests to known tracker endpoints are rendered ineffective. In this paper, we investigate server-side implementation of Google Analytics, the most widely deployed third-party tracking service on the web today. We also present SST-Guard, a multi-modal, browser-based system for detecting and blocking server-side Google Analytics (sGA). Our key insight is that even when the tracker's endpoints change, sGA must necessarily still collect and share the same semantic information as client-side Google Analytics (e.g., identifiers, event metadata). Therefore, rather than detecting requests to known Google Analytics endpoints, SST-Guard aims to detect underlying artifacts of collection and sharing of these semantic values to any arbitrary endpoint. Operationalizing this insight is challenging because real-world sGA deployments commonly customize endpoints and obfuscate URLs/payloads. SST-Guard addresses this challenge using a value-template approach that employs regular expressions to match semantic value patterns across multiple modalities: network requests, cookies, and the window object. We validate SST-Guard on Tranco top-10k websites, detecting 4.02\% (403) sGA domains with over 93\% accuracy across three modalities, with network request classifier demonstrating the highest accuracy (99.8\%). By deploying SST-Guard in the wild, we find 4.21\% (6,314) of Tranco top-150k websites using sGA.