Kevin Callahan-Coray

STAT-MECH

3papers

3citations

Novelty62%

AI Score50

Ranked #41,876 of 205,806 authors (top 20%)#2 in STAT-MECH (top 3%)

3 Papers

76.9ETMay 31

Probabilistic Computers for MIMO Detection: From Sparsification to 2D Parallel Tempering

M Mahmudul Hasan Sajeeb, Kevin Callahan-Coray, Corentin Delacour et al.

Probabilistic computers built from p-bits offer a promising path for combinatorial optimization, but the dense connectivity required by real-world problems scales poorly in hardware. Here, we address this through graph sparsification with auxiliary copy variables and demonstrate two fully on-chip parallel tempering solvers on an FPGA. Targeting MIMO detection, a dense, NP-hard problem central to wireless communications, we first fit 11 temperature replicas of a 128-node sparsified system (1,408 p-bits) on-chip and achieve bit error rates significantly below conventional linear detectors on $64 \times 64$ BPSK MIMO. We report complete end-to-end solution times of 3~ms per instance, including all loading, sampling, readout, and verification overheads. ASIC projections in 7~nm technology indicate 103~MHz operation at 285.8~mW, suggesting that massive parallelism across multiple chips could approach the throughput demands of next-generation wireless systems. Sparsification, however, introduces a sharp sensitivity to the copy-constraint strength $P$ that requires manual tuning. To eliminate this bottleneck, we utilize Two-Dimensional Parallel Tempering (2D-PT), which exchanges replicas across both temperature ($β$) and constraint ($P$) dimensions. On Sherrington--Kirkpatrick spin glasses, 2D-PT converges roughly $250\times$ faster than optimally tuned 1D-PT, and on $128 \times 128$ MIMO it reaches zero bit errors at high SNR where 1D-PT exhibits an error floor. We further validate 2D-PT entirely on-chip with 54 replicas (1,728 p-bits) on a $16 \times 16$ MIMO instance, where it tracks the maximum-likelihood bound in just 50 Monte Carlo steps -- $10\times$ fewer than 1D-PT -- at projected 111~MHz and 124~mW in 7~nm. Together, these results establish an on-chip p-bit architecture and a scalable, tuning-free algorithmic framework for dense combinatorial optimization.

24.2STAT-MECHMay 20

Restoring Sparsity in Potts Machines via Mean-Field Constraints

Kevin Callahan-Coray, Kyle Lee, Kyle Jiang et al.

Ising machines and related probabilistic hardware have emerged as promising platforms for NP-hard optimization and sampling. However, many practical problems involve constraints that induce dense or all-to-all couplings, undermining scalability and hardware efficiency. We address this constraint-induced density through two complementary approaches. First, we introduce a hardware-aware native formulation for multi-state probabilistic digits (p-dits) that avoids the locally dense intra-variable couplings required by binary Ising encodings. We validate p-dit dynamics by reproducing known critical behavior of the 2D Potts model. Second, we propose mean-field constraints (MFC), a hybrid scheme that replaces dense pairwise constraint couplings with dynamically updated single-node biases. Applied to balanced graph partitioning, MFC achieves solution quality comparable to exact all-to-all constraint formulations while dramatically reducing graph density. Finally, we demonstrate the practical impact of restored sparsity through an FPGA implementation that accelerates partitioning by more than an order of magnitude over CPU probabilistic solvers and by over two orders of magnitude over a Tabu Ising baseline. Together, these results outline a pathway for scaling constrained optimization on probabilistic hardware.

75.9LGMay 3Code

Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee, Corentin Delacour, Kevin Callahan-Coray et al.

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git