DCApr 22

Characterizing and Fixing Silent Data Loss in Spark-on-AWS-Lambda with Open Table Formats

arXiv:2604.200817.6
Predicted impact top 80% in DC · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a critical reliability issue for users of serverless big data processing with open table formats, offering a practical fix.

The paper tackled silent data loss in Spark-on-AWS-Lambda when jobs are killed between data upload and metadata commit phases, finding 100% data loss in controlled experiments, and presented SafeWriter, which achieved clean rollbacks with under 100 ms overhead.

AWS Lambda terminates containers with an uncatchable SIGKILL signal when a function exceeds its configured timeout. When a Spark-on-AWS-Lambda (SoAL) job is killed between Phase 1 (data upload) and Phase 2 (metadata commit) of a write, the result is silent data loss: orphaned Parquet files accumulate on S3 while the table's committed state remains unchanged and standard monitoring raises no alert. We characterize this vulnerability across Delta Lake and Apache Iceberg through 860 controlled kill-injection experiments at three dataset sizes. A SIGKILL landing in the inter-phase gap produced silent data loss in 100% of trials for both formats. We then present SafeWriter, a language-level wrapper that arms a watchdog thread 30 seconds before the Lambda timeout, triggers a format-native rollback via SQL, and records a checkpoint document on S3. SafeWriter converted every tested kill scenario into a clean, detectable rollback with under 100 ms added to normal write paths.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes