Kyle Jiang

24.2STAT-MECHMay 20

Restoring Sparsity in Potts Machines via Mean-Field Constraints

Kevin Callahan-Coray, Kyle Lee, Kyle Jiang et al.

Ising machines and related probabilistic hardware have emerged as promising platforms for NP-hard optimization and sampling. However, many practical problems involve constraints that induce dense or all-to-all couplings, undermining scalability and hardware efficiency. We address this constraint-induced density through two complementary approaches. First, we introduce a hardware-aware native formulation for multi-state probabilistic digits (p-dits) that avoids the locally dense intra-variable couplings required by binary Ising encodings. We validate p-dit dynamics by reproducing known critical behavior of the 2D Potts model. Second, we propose mean-field constraints (MFC), a hybrid scheme that replaces dense pairwise constraint couplings with dynamically updated single-node biases. Applied to balanced graph partitioning, MFC achieves solution quality comparable to exact all-to-all constraint formulations while dramatically reducing graph density. Finally, we demonstrate the practical impact of restored sparsity through an FPGA implementation that accelerates partitioning by more than an order of magnitude over CPU probabilistic solvers and by over two orders of magnitude over a Tabu Ising baseline. Together, these results outline a pathway for scaling constrained optimization on probabilistic hardware.

75.9LGMay 3Code

Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee, Corentin Delacour, Kevin Callahan-Coray et al.

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

Kyle Jiang

2 Papers