52.9DBJun 3Code
Indexicon: A Spatial Indexing LibraryPanagiotis Simatis, Panagiotis Bouros, Nikos Mamoulis
Spatial indexing is foundational to Geographic Information Systems (GIS) and multi-dimensional data management, yet the current open-source landscape poses a significant barrier to research that employs or benchmarks spatial access methods. We observe that most of the existing open-source libraries include a single index. Some are hindered by complex dependencies, missing critical functionalities, inconsistent APIs, and rigid constraints regarding the support of spatial data types. To address this issue, we introduce Indexicon: a unified, highly portable, extendable, open-source spatial indexing library, designed specifically for rapid integration and ease of modification of main-memory spatial access methods. Indexicon provides a comprehensive suite of popular tree-based spatial access methods, including the R-tree, Quad-tree variants, and the KD-tree. Each structure is meticulously implemented as a self-contained, single-file, header-only C++ template with zero external dependencies beyond the standard library. Crucially, every index features a uniform interface, natively supporting bulk-loading, dynamic insertions/deletions, range queries, $k$-nearest neighbor ($k$NN) search, and structural statistics tracking. We also present an extensive performance evaluation of Indexicon against well-established and widely used implementations of these structures (including Boost Geometry, PCL, and Nanoflann) across six real-world geographic datasets. Our results demonstrate that Indexicon's lightweight design matches or outperforms existing state-of-the-art implementations while offering unmatched architectural flexibility. To foster reproducible spatial research, we open-source the complete library alongside our datasets and query workloads.
57.9IRJun 3
ANN Search: Recall What MattersDimitris Dimitropoulos, Nikos Mamoulis
Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.
76.8DBMay 14
Toward Temporal Attribution Analytics in DataflowsChrysanthi Kosyfaki, Ruiyuan Zhang, Nikos Mamoulis et al.
Data provenance (the process of determining the origin and derivation of data outputs) has applications across multiple domains including explaining database query results and auditing scientific workflows. Despite decades of research, provenance tracing remains challenging due to its high computational cost and storage requirements. In streaming systems such as Apache Flink, fine-grained provenance graphs can grow super-linearly with data volume, posing significant scalability challenges. We define temporal attribution, a new lightweight form of provenance, appropriate for certain tasks, such as monitoring dependencies between system components over time quantitatively. Temporal attribution enables time-focused analysis that does not require fine-grained, tuple-level dependency meta-data. Inspired by volume-based provenance tracking in Temporal Interaction Networks (TINs), we demonstrate TINs' applicability in succinctly modeling quantified data exchanges between dataflow operators in stream data processing systems and in processing workflows, in general, over time. We classify data into discrete and liquid types, define five temporal provenance query types, and propose a state-based indexing approach. Our vision outlines research directions toward making this new form of temporal attribution a practical tool for large-scale dataflow analytics.
LGSep 7, 2020Code
Fast and Secure Distributed Nonnegative Matrix FactorizationYuqiu Qian, Conghui Tan, Danhao Ding et al.
Nonnegative matrix factorization (NMF) has been successfully applied in several data mining tasks. Recently, there is an increasing interest in the acceleration of NMF, due to its high cost on large matrices. On the other hand, the privacy issue of NMF over federated data is worthy of attention, since NMF is prevalently applied in image and text analysis which may involve leveraging privacy data (e.g, medical image and record) across several parties (e.g., hospitals). In this paper, we study the acceleration and security problems of distributed NMF. Firstly, we propose a distributed sketched alternating nonnegative least squares (DSANLS) framework for NMF, which utilizes a matrix sketching technique to reduce the size of nonnegative least squares subproblems with a convergence guarantee. For the second problem, we show that DSANLS with modification can be adapted to the security setting, but only for one or limited iterations. Consequently, we propose four efficient distributed NMF methods in both synchronous and asynchronous settings with a security guarantee. We conduct extensive experiments on several real datasets to show the superiority of our proposed methods. The implementation of our methods is available at https://github.com/qianyuqiu79/DSANLS.
15.9DBMay 5
In-memory Multidimensional Indexing Using the skd-treeAchilleas Michalopoulos, Dimitrios Tsitsigkos, Nikos Mamoulis
In this paper, we revisit the problem of indexing multi-dimensional data in memory for the efficient support of multi-dimensional range queries and nearest neighbor queries. This is a classic problem in main-memory databases, where there is a need for indexing multiple columns simultaneously. Established data structures include the R-tree, kd-tree, quad-tree, and grid-based partitioning. More recently, multi-dimensional learned indexes have also been proposed to address this problem. We propose slicing kd-tree (skd-tree), a variant of the kd-tree, where each node partitions the space of its subtree into multiple slices across a single splitting dimension. By compressing the splitters of the partitions and with the help of data-parallelism, we (i) radically reduce the number of levels of the tree and (ii) limit the number of computations required for multi-dimensional range and proximity queries. The nodes of the skd-tree resemble the nodes of a main-memory B+-tree, however, a different dimension is used at each level. Our novel range and kNN algorithms on the skd-tree apply only a small constant number of SIMD instructions at each node during tree traversal. Our contributions also include a novel top-down construction algorithm, different types of inner and leaf nodes that warrant tree balancing, and a novel update algorithm. Our skd-tree achieves strong performance compared to existing methods, according to our experimental evaluation on real and synthetic datasets.
LGDec 18, 2020
Leveraging Meta-path Contexts for Classification in Heterogeneous Information NetworksXiang Li, Danhao Ding, Ben Kao et al.
A heterogeneous information network (HIN) has as vertices objects of different types and as edges the relations between objects, which are also of various types. We study the problem of classifying objects in HINs. Most existing methods perform poorly when given scarce labeled objects as training sets, and methods that improve classification accuracy under such scenarios are often computationally expensive. To address these problems, we propose ConCH, a graph neural network model. ConCH formulates the classification problem as a multi-task learning problem that combines semi-supervised learning with self-supervised learning to learn from both labeled and unlabeled data. ConCH employs meta-paths, which are sequences of object types that capture semantic relationships between objects. ConCH co-derives object embeddings and context embeddings via graph convolution. It also uses the attention mechanism to fuse such embeddings. We conduct extensive experiments to evaluate the performance of ConCH against other 15 classification methods. Our results show that ConCH is an effective and efficient method for HIN classification.
DBNov 4, 2019
A General Early-Stopping Module for Crowdsourced RankingCaihua Shan, Leong Hou U, Nikos Mamoulis et al.
Crowdsourcing can be used to determine a total order for an object set (e.g., the top-10 NBA players) based on crowd opinions. This ranking problem is often decomposed into a set of microtasks (e.g., pairwise comparisons). These microtasks are passed to a large number of workers and their answers are aggregated to infer the ranking. The number of microtasks depends on the budget allocated for the problem. Intuitively, the higher the number of microtask answers, the more accurate the ranking becomes. However, it is often hard to decide the budget required for an accurate ranking. We study how a ranking process can be terminated early, and yet achieve a high-quality ranking and great savings in the budget. We use statistical tools to estimate the quality of the ranking result at any stage of the crowdsourcing process and terminate the process as soon as the desired quality is achieved. Our proposed early-stopping module can be seamlessly integrated with most existing inference algorithms and task assignment methods. We conduct extensive experiments and show that our early-stopping module is better than other existing general stopping criteria. We also implement a prototype system to demonstrate the usability and effectiveness of our approach in practice.
LGNov 4, 2019
An End-to-End Deep RL Framework for Task Arrangement in Crowdsourcing PlatformsCaihua Shan, Nikos Mamoulis, Reynold Cheng et al.
In this paper, we propose a Deep Reinforcement Learning (RL) framework for task arrangement, which is a critical problem for the success of crowdsourcing platforms. Previous works conduct the personalized recommendation of tasks to workers via supervised learning methods. However, the majority of them only consider the benefit of either workers or requesters independently. In addition, they cannot handle the dynamic environment and may produce sub-optimal results. To address these issues, we utilize Deep Q-Network (DQN), an RL-based method combined with a neural network to estimate the expected long-term return of recommending a task. DQN inherently considers the immediate and future reward simultaneously and can be updated in real-time to deal with evolving data and dynamic changes. Furthermore, we design two DQNs that capture the benefit of both workers and requesters and maximize the profit of the platform. To learn value functions in DQN effectively, we also propose novel state representations, carefully design the computation of Q values, and predict transition probabilities and future states. Experiments on synthetic and real datasets demonstrate the superior performance of our framework.
AIJan 19, 2017
Heterogeneous Information Network Embedding for Meta Path based ProximityZhipeng Huang, Nikos Mamoulis
A network embedding is a representation of a large graph in a low-dimensional space, where vertices are modeled as vectors. The objective of a good embedding is to preserve the proximity between vertices in the original graph. This way, typical search and mining methods can be applied in the embedded space with the help of off-the-shelf multidimensional indexing approaches. Existing network embedding techniques focus on homogeneous networks, where all vertices are considered to belong to a single class.