Adrian Nilsson

DS
3papers
7citations
Novelty45%
AI Score20

3 Papers

DSNov 21, 2019
S-RASTER: Contraction Clustering for Evolving Data Streams

Gregor Ulm, Simon Smith, Adrian Nilsson et al.

Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

DSJul 8, 2019
Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

Gregor Ulm, Simon Smith, Adrian Nilsson et al.

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.

DCMar 22, 2019
Facilitating Rapid Prototyping in the OODIDA Data Analytics Platform via Active-Code Replacement

Gregor Ulm, Simon Smith, Adrian Nilsson et al.

OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. As this is potentially disruptive, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system; they can even be replaced between iterations of an ongoing assignment. This feature is referred to as active-code replacement. It facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly. Consistency of results is achieved by majority vote, which prevents tainted state. Active-code replacement can be done in less than a second in an idealized setting whereas a standard deployment takes many orders of magnitude more time. The main contribution of this paper is the description of a relatively straightforward approach to active-code replacement that is very user-friendly. It enables a data analyst to quickly execute custom code on the cloud server as well as on client devices. Sensible safeguards and design decisions ensure that this feature can be used by non-specialists who are not familiar with the implementation of OODIDA in general or this feature in particular. As a consequence of adding the active-code replacement feature, OODIDA is now very well-suited for rapid prototyping.