48.9CRApr 19Code
Original Sin of npm: A Study on Vulnerability Propagation in JavaScript Dependency NetworksMichael Robinson, Sajal Halder, Muhammad Ejaz Ahmed et al.
Understanding vulnerability propagation is essential for assessing how vulnerabilities spread across components of a software package. This supports more accurate impact analysis and enhances threat detection and mitigation. In this paper, we investigate how a small number of vulnerable JavaScript packages contribute to the creation of a disproportionately large number of vulnerable packages. This paper presents insights from 1,515 reported vulnerabilities gathered from a custom-built vulnerability database containing 1,077,946 JavaScript packages sourced from `npm-follower' and their associated dependency networks. Dependency networks were constructed using the deps.dev API, with vulnerabilities identified by parsing package names and version numbers through the Google Open Source Vulnerability API. Our findings reveal that 61.30% (660,748) of packages are reliant on one or more dependency packages, and 21.60% (232,836) of total packages have at least one known vulnerability throughout their dependency networks -- of which most (42%) are of High severity. We also found that it takes, on average, approximately 4 years and 11 months to fix a vulnerable package from when the first vulnerable version is published on npm -- although publication times of vulnerabilities occur approximately 19 days after a fix is available. Finally, we observe a high concentration of frequently present vulnerabilities throughout dependency networks, with the top-7 most frequent vulnerabilities accounting for 25% of vulnerability cases and the top-23 most frequent accounting for 50%. Based on these findings, we propose recommendations for developers and package managers to mitigate the threat and occurrence of vulnerabilities within the npm dependency network and the broader software repository community.
DGOct 11, 2024Code
The structure of the token space for large language modelsMichael Robinson, Sourya Dey, Shauna Sweet
Large language models encode the correlational structure present in natural language by fitting segments of utterances (tokens) into a high dimensional ambient latent space upon which the models then operate. We assert that in order to develop a foundational, first-principles understanding of the behavior and limitations of large language models, it is crucial to understand the topological and geometric structure of this token subspace. In this article, we present estimators for the dimension and Ricci scalar curvature of the token subspace, and apply it to three open source large language models of moderate size: GPT2, LLEMMA7B, and MISTRAL7B. In all three models, using these measurements, we find that the token subspace is not a manifold, but is instead a stratified manifold, where on each of the individual strata, the Ricci curvature is significantly negative. We additionally find that the dimension and curvature correlate with generative fluency of the models, which suggest that these findings have implications for model behavior.
CLApr 1, 2025Code
Token embeddings violate the manifold hypothesisMichael Robinson, Sourya Dey, Tony Chiang
A full understanding of the behavior of a large language model (LLM) requires our grasp of its input token space. If this space differs from our assumptions, our comprehension of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token $ψ$ implies an irregularity in the token subspace in a $ψ$-neighborhood, $B(ψ)$. The structure assumed in the null is a generalization of a manifold with boundary called a \emph{smooth fiber bundle} (which can be split into two spatial regimes -- small and large radius), so we denote our new hypothesis test as the ``fiber bundle hypothesis.'' By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, if one prompt contains a token implicated by our test, the response to that prompt will likely exhibit less stability than the other.
PLFeb 9, 2023
Unsupervised clustering of file dialects according to monotonic decompositions of mixturesMichael Robinson, Tate Altman, Denley Lam et al.
This paper proposes an unsupervised classification method that partitions a set of files into non-overlapping dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. Patterns of messages can be used to classify files into dialects. A dialect is defined by a subset of messages, called the required messages. Once files are conditioned upon dialect and its required messages, the remaining messages are statistically independent. With this definition of dialect in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format.
CYJan 30, 2024
Prospects for inconsistency detection using large language models and sheavesSteve Huntsman, Michael Robinson, Ludmilla Huntsman
We demonstrate that large language models can produce reasonable numerical ratings of the logical consistency of claims. We also outline a mathematical approach based on sheaf theory for lifting such ratings to hypertexts such as laws, jurisprudence, and social media and evaluating their consistency globally. This approach is a promising avenue to increasing consistency in and of government, as well as to combating mis- and disinformation and related ills.
DGMar 19, 2025
Probing the topology of the space of tokens with structured promptsMichael Robinson, Sourya Dey, Taisa Kushner
This article presents a general and flexible method for prompting a large language model (LLM) to reveal its (hidden) token input embedding up to homeomorphism. Moreover, this article provides strong theoretical justification -- a mathematical proof for generic LLMs -- for why this method should be expected to work. With this method in hand, we demonstrate its effectiveness by recovering the token subspace of Llemma-7B. The results of this paper apply not only to LLMs but also to general nonlinear autoregressive processes.
SEMar 2, 2020
Topological Differential TestingKristopher Ambrose, Steve Huntsman, Michael Robinson et al.
We introduce topological differential testing (TDT), an approach to extracting the consensus behavior of a set of programs on a corpus of inputs. TDT uses the topological notion of a simplicial complex (and implicitly draws on richer topological notions such as sheaves and persistence) to determine inputs that cause inconsistent behavior and in turn reveal \emph{de facto} input specifications. We gently introduce TDT with a toy example before detailing its application to understanding the PDF file format from the behavior of various parsers. Finally, we discuss theoretical details and other possible applications.
LOJan 14, 2020
The geometry of syntax and semantics for directed file transformationsSteve Huntsman, Michael Robinson
We introduce a conceptual framework that associates syntax and semantics with vertical and horizontal directions in principal bundles and related constructions. This notion of geometry corresponds to a mechanism for performing goal-directed file transformations such as "eliminate unsafe syntax" and suggests various engineering practices.
CYDec 31, 2018
Developing Cyber Buffer ZonesMichael Robinson, Kevin Jones, Helge Janicke et al.
The United Nations conducts peace operations around the world, aiming tomaintain peace and security in conflict torn areas. Whilst early operations werelargely successful, the changing nature of warfare and conflict has often left peaceoperations strugglingto adapt. In this article, we make a contribution towardsefforts to plan for the next evolution in both intra and inter-state conflict: cyberwarfare. It is now widely accepted that cyber warfare will be a component offuture conflicts, and much researchhas been devoted to how governments andmilitaries can prepare for and fight in this new domain [1]. Despite the vastamount of research relating to cyber warfare, there has been less discussion onits impact towards successful peace operations. This is agap in knowledge thatis important to address, since the restoration of peace following conflict of anykind is of global importance. It is however a complex topic requiring discussionacross multiple domains. Input from the technical, political, governmental andsocietal domains are critical in forming the concept of cyber peacekeeping.Previous work on this topic has sought to define the concept of cyber peacekeeping[2, 3, 4]. We build upon this work by exploring the practicalities ofstarting up a cyber peacekeeping component and setting up a Cyber Buffer Zone (CBZ).
CRNov 1, 2017
Internet of Cloud: Security and Privacy issuesAllan Cook, Michael Robinson, Mohamed Amine Ferrag et al.
The synergy between the cloud and the IoT has emerged largely due to the cloud having attributes which directly benefit the IoT and enable its continued growth. IoT adopting Cloud services has brought new security challenges. In this book chapter, we pursue two main goals: 1) to analyse the different components of Cloud computing and the IoT and 2) to present security and privacy problems that these systems face. We thoroughly investigate current security and privacy preservation solutions that exist in this area, with an eye on the Industrial Internet of Things, discuss open issues and propose future directions
CVJun 28, 2016
A Topological Lowpass Filter for Quasiperiodic SignalsMichael Robinson
This article presents a two-stage topological algorithm for recovering an estimate of a quasiperiodic function from a set of noisy measurements. The first stage of the algorithm is a topological phase estimator, which detects the quasiperiodic structure of the function without placing additional restrictions on the function. By respecting this phase estimate, the algorithm avoids creating distortion even when it uses a large number of samples for the estimate of the function.