StreamSampling.jl: Efficient Sampling from Data Streams in Julia
This provides a practical tool for data scientists and engineers working with streaming data, but it is incremental as it builds on existing sampling methods.
The authors tackled the problem of sampling from data streams with unknown sizes by developing StreamSampling.jl, a Julia library that enables efficient single-pass sampling with constant memory usage, avoiding full materialization and showing performance and memory improvements in benchmarks.
StreamSampling$.$jl is a Julia library designed to provide general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown. In this paper, we describe the capabilities of the library and its advantages over traditional sampling procedures, such as maintaining a small, constant memory footprint and avoiding the need to fully materialize the stream in memory. Furthermore, we provide empirical benchmarks comparing online sampling methods against standard approaches, demonstrating performance and memory improvements.