AR CV LGNov 17, 2025

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, Mohsen Imani

arXiv:2511.13679v11.2h-index: 9

Originality Highly original

AI Analysis

This work addresses performance bottlenecks in state-of-the-art detection models for computer vision applications, offering significant hardware acceleration.

The paper tackles the problem of inefficient hardware mapping for deformable transformers due to irregular memory access and low arithmetic intensity by introducing QUILL, an algorithm-architecture co-design that achieves up to 7.29x higher throughput and 47.3x better energy efficiency compared to an RTX 4090.

Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.

View on arXiv PDF

Similar