Data Twinning
This work addresses the need for efficient data splitting in big data contexts, though it is incremental as it builds on the SPlit method.
The authors tackled the problem of partitioning datasets into statistically similar subsets by developing Twinning, a method that is orders of magnitude faster than the existing SPlit algorithm, enabling applications in big data compression and cross-validation.
In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and $k$-fold cross validation.