Should I Hide My Duck in the Lake?
This work addresses the I/O bottleneck in cloud data lakes for query engines like DuckDB, but the proposal is at a vision stage with only experimental estimations.
Data lakes spend significant time scanning remote data, with decoding alone accounting for 46% of TPC-H runtime on Parquet files. The authors propose a SmartNIC to offload decoding and pushed-down operators, estimating that it can match query throughput of traditional setups with smaller, less expensive CPUs.
Data lakes spend a significant fraction of query execution time on scanning data from remote, disaggregated storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of parsing raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data, as delivered by a SmartNIC, we can significantly increase query processing performance and can still match query throughput of traditional setups with smaller, less expensive CPUs.