DBMay 5

Should I Hide My Duck in the Lake?

arXiv:2602.1877561.2h-index: 7
AI Analysis

This work addresses the I/O bottleneck in cloud data lakes for query engines like DuckDB, but the proposal is at a vision stage with only experimental estimations.

Data lakes spend significant time scanning remote data, with decoding alone accounting for 46% of TPC-H runtime on Parquet files. The authors propose a SmartNIC to offload decoding and pushed-down operators, estimating that it can match query throughput of traditional setups with smaller, less expensive CPUs.

Data lakes spend a significant fraction of query execution time on scanning data from remote, disaggregated storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of parsing raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data, as delivered by a SmartNIC, we can significantly increase query processing performance and can still match query throughput of traditional setups with smaller, less expensive CPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes