DSGNMar 16

Hecate: A Modular Genomic Compressor

arXiv:2603.1539013.3h-index: 38
Predicted impact top 65% in DS · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of efficient genomic data storage and access for bioinformatics researchers, offering incremental improvements in speed and compression.

The paper tackles genomic compression by introducing Hecate, a modular lossless framework that treats compression as a conditional coding problem over coupled FASTA/FASTQ streams, resulting in the best compression vs. speed trade-offs against state-of-the-art tools, with Hecate being 2 to 10 times faster for the same compression ratio and achieving up to 5% to 10% better compression under the same time budget.

We present Hecate, a modular lossless genomic compression framework. It is designed around uncommon but practical source-coding choices. Unlike many single-method compressors, Hecate treats compression as a conditional coding problem over coupled FASTA/FASTQ streams (control, headers, nucleotides, case, quality, extras). It uses per-stream codecs under a shared indexed block container. Codecs include alphabet-aware packing with an explicit side channel for out-of-alphabet residues, an auxiliary-index Burrows-Wheeler pipeline with custom arithmetic coding, and a blockwise Markov mixture coder with explicit model-competition signaling. This architecture yields high throughput, exact random-access slicing, and referential mode through streamwise binary differencing. In a comprehensive benchmark suite, Hecate provides the best compression vs. speed trade-offs against state-of-the-art established tools (MFCompress, NAF, bzip3, AGC), with notably stronger behaviour on large genomes and high-similarity referential settings. For the same compression ratio, Hecate is 2 to 10 times faster. When given the same time budget as other algorithms, Hecate achieves up to 5% to 10% better compression.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes