Weighted Reservoir Sampling With Replacement from Data Streams
This addresses a gap in streaming data sampling for applications requiring weighted representation, though it is incremental as it extends existing methods to sampling with replacement.
The paper tackles the problem of weighted random sampling with replacement from data streams, where element inclusion probability is proportional to weight, and presents a new one-pass algorithm that is proven correct and efficient, with experimental analysis showing its performance compared to state-of-the-art methods.
In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement, although most of the literature on this topic has focused on sampling without replacement. Our algorithm generates a weighted random sample in one pass over a population of unknown size. At any point in time, the sample is representative of the population seen so far and can be directly used by other modules without requiring any post-processing. We formally prove the correctness and efficiency of our method. An experimental analysis shows the performance of our method in practice when compared to state-of-the-art methods.