DB IR LGOct 17, 2024

Lightweight Correlation-Aware Table Compression

Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf

arXiv:2410.14066v31.23 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This work addresses storage efficiency for users of data lakes managing relational data, representing an incremental improvement over existing correlation-aware methods.

The paper tackles the problem of inefficient compression in data lake storage formats by introducing a framework that automatically leverages data correlations, achieving up to 40% file size reduction compared to Apache Parquet with minimal scan overhead.

The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet.

View on arXiv PDF Code

Similar