LeCo: Lightweight Compression via Learning Serial Correlations
This work addresses performance bottlenecks in data analytics and storage systems by enhancing compression efficiency, though it builds on existing encoding methods in an incremental way.
The paper tackled the problem of lightweight data compression for column stores by exploiting serial correlations in data sequences, achieving a Pareto improvement in compression ratio and random access speed, with up to 5.2x query speedup in Arrow and 16% throughput increase in RocksDB.
Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 5.2x speed up in a data analytical query in the Arrow columnar execution engine and a 16% increase in RocksDB's throughput.