9.5CGMay 22
Dynamic Query Modification for Binary Locality Sensitive HashingBen Claydon, Richard Connor, Alan Dearle
Our context of interest is how binary locality sensitive hash (LSH) functions can be used to solve the approximate near neighbour (ANN) problem, which seeks to find the k closest elements of some dataset X to some further point q presented as a query. Binary locality sensitive function families H are sets of functions each which accept a point and return a binary value. A function is locality sensitive if and only if the output of the function is more likely to be equal (a 'hash collision') if two close vectors are used as input than if two far vectors are used. A data structure can be built by generating binary hash codes for each member of X, which are generated by drawing and applying one or more functions from H. When q is presented as a query, the same set of functions is applied to it and those elements of X with equal binary hash codes are retrieved. In this paper we introduce dynamic query modification. This process changes q at query time to form a new value c, which by theoretical and experimental analysis we prove has two significant advantages. Firstly, the hash output of c collides with near neighbours with a greater probability than q. Secondly, we show there is little chance of c failing to collide with any near neighbours; a property which we demonstrate is not true for q. To demonstrate the efficacy of the technique, we define a novel structure MQ-Forest, a modified version of RP-Forest. Both are binary LSH-based ANN mechanisms, but MQ-Forest dynamically estimates a value for c during the query process. We show that MQ-Forest reduces both build and query times by up to 40% when measured over several large, high-dimensional benchmark datasets.
IRFeb 22, 2023
nSimplex Zen: A Novel Dimensionality Reduction for Euclidean and Hilbert SpacesRichard Connor, Lucia Vadicamo
Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described. They have been classified in two main groups: linear and topological. Linear methods such as Principal Component Analysis (PCA) and Random Projection (RP) define matrix-based transforms into a lower dimension of Euclidean space. Topological methods such as Multidimensional Scaling (MDS) attempt to preserve higher-level aspects such as the nearest-neighbour relation, and some may be applied to non-Euclidean spaces. Here, we introduce nSimplex Zen, a novel topological method of reducing dimensionality. Like MDS, it relies only upon pairwise distances measured in the original space. The use of distances, rather than coordinates, allows the technique to be applied to both Euclidean and other Hilbert spaces, including those governed by Cosine, Jensen-Shannon and Quadratic Form distances. We show that in almost all cases, due to geometric properties of high-dimensional spaces, our new technique gives better properties than others, especially with reduction to very low dimensions.
LGMay 31, 2025
Ultra-Quantisation: Efficient Embedding Search via 1.58-bit EncodingsRichard Connor, Alan Dearle, Ben Claydon
Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue. Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function. Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set {-1,0,1}. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements. This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy.
IRJul 26, 2017
High-Dimensional Simplexes for Supermetric SearchRichard Connor, Lucia Vadicamo, Fausto Rabitti
In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These include Euclidean, Cosine and Jensen-Shannon spaces of any dimension. A simple corollary of the n-point property is that, for any (n+1) objects sampled from the space, there exists an n-dimensional simplex in Euclidean space whose edge lengths correspond to the distances among the objects. We show how the construction of such simplexes in higher dimensions can be used to give arbitrarily tight lower and upper bounds on distances within the original space. This allows the construction of an n-dimensional Euclidean space, from which lower and upper bounds of the original space can be calculated, and which is itself an indexable space with the n-point property. For similarity search, the engineering tradeoffs are good: we show significant reductions in data size and metric cost with little loss of accuracy, leading to a significant overall improvement in search performance.
IRJul 26, 2017
Supermetric SearchRichard Connor, Lucia Vadicamo, Franco Alberto Cardillo et al.
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general,a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space which is finitely four-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric space as, in terms of metric search, they are significantly more tractable. Supermetric spaces include all those governed by Euclidean, Cosine, Jensen-Shannon and Triangular distances, and are thus commonly used within many domains. In previous work we have given a generic mathematical basis for the supermetric property and shown how it can improve indexing performance for a given exact search structure. Here we present a full investigation into its use within a variety of different hyperplane partition indexing structures, and go on to show some more of its flexibility by examining a search structure whose partition and exclusion conditions are tailored, at each node, to suit the individual reference points and data set present there. Among the results given, we show a new best performance for exact search using a well-known benchmark.
IRApr 28, 2016
Hilbert Exclusion: Improved Metric Search through Finite Isometric EmbeddingsRichard Connor, Franco Alberto Cardillo, Lucia Vadicamo et al.
Most research into similarity search in metric spaces relies upon the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger property, sometimes called the four-point property: in essence, these spaces allow an isometric embedding of any four points in three-dimensional Euclidean space, as well as any three points in two-dimensional Euclidean space. In fact, we show that any space which is isometrically embeddable in Hilbert space has the stronger property. This property gives stronger geometric guarantees, and one in particular, which we name the Hilbert Exclusion property, allows any indexing mechanism which uses hyperplane partitioning to perform better. One outcome of this observation is that a number of state-of-the-art indexing mechanisms over high dimensional spaces can be easily extended to give a significant increase in performance; furthermore, the improvement given is greater in higher dimensions. This therefore leads to a significant improvement in the cost of metric search in these spaces.