Tathagata Srimani

ET
3papers
3citations
Novelty60%
AI Score50

3 Papers

24.6ETMay 31
Probabilistic Computers for MIMO Detection: From Sparsification to 2D Parallel Tempering

M Mahmudul Hasan Sajeeb, Kevin Callahan-Coray, Corentin Delacour et al.

Probabilistic computers built from p-bits offer a promising path for combinatorial optimization, but the dense connectivity required by real-world problems scales poorly in hardware. Here, we address this through graph sparsification with auxiliary copy variables and demonstrate two fully on-chip parallel tempering solvers on an FPGA. Targeting MIMO detection, a dense, NP-hard problem central to wireless communications, we first fit 11 temperature replicas of a 128-node sparsified system (1,408 p-bits) on-chip and achieve bit error rates significantly below conventional linear detectors on $64 \times 64$ BPSK MIMO. We report complete end-to-end solution times of 3~ms per instance, including all loading, sampling, readout, and verification overheads. ASIC projections in 7~nm technology indicate 103~MHz operation at 285.8~mW, suggesting that massive parallelism across multiple chips could approach the throughput demands of next-generation wireless systems. Sparsification, however, introduces a sharp sensitivity to the copy-constraint strength $P$ that requires manual tuning. To eliminate this bottleneck, we utilize Two-Dimensional Parallel Tempering (2D-PT), which exchanges replicas across both temperature ($β$) and constraint ($P$) dimensions. On Sherrington--Kirkpatrick spin glasses, 2D-PT converges roughly $250\times$ faster than optimally tuned 1D-PT, and on $128 \times 128$ MIMO it reaches zero bit errors at high SNR where 1D-PT exhibits an error floor. We further validate 2D-PT entirely on-chip with 54 replicas (1,728 p-bits) on a $16 \times 16$ MIMO instance, where it tracks the maximum-likelihood bound in just 50 Monte Carlo steps -- $10\times$ fewer than 1D-PT -- at projected 111~MHz and 124~mW in 7~nm. Together, these results establish an on-chip p-bit architecture and a scalable, tuning-free algorithmic framework for dense combinatorial optimization.

55.8ETMar 18
Probabilistic approximate optimization using single-photon avalanche diode arrays

Ziyad Alswaidan, Abdelrahman S. Abdelrahman, Md Sakibur Sajal et al.

Combinatorial optimization problems are central to science and engineering and specialized hardware from quantum annealers to classical Ising machines are being actively developed to address them. These systems typically sample from a fixed energy landscape defined by the problem Hamiltonian encoding the discrete optimization problem. The recently introduced Probabilistic Approximate Optimization Algorithm (PAOA) takes a different approach: it treats the optimization landscape itself as variational, iteratively learning circuit parameters from samples. Here, we demonstrate PAOA on a 64$\times$64 perimeter-gated single-photon avalanche diode (pgSPAD) array fabricated in 0.35 $μ$m CMOS, the first realization of the algorithm using intrinsically stochastic nanodevices. Each p-bit exhibits a device-specific, asymmetric (Gompertz-type) activation function due to dark-count variability. Rather than calibrating devices to enforce a uniform symmetric (logistic/tanh) activation, PAOA learns around device variations, absorbing residual activation and other mismatches into the variational parameters. On canonical 26-spin Sherrington-Kirkpatrick instances, PAOA achieves high approximation ratios with $2p$ parameters ($p$ up to 17 layers), and pgSPAD-based inference closely tracks CPU simulations. These results show that variational learning can accommodate the non-idealities inherent to nanoscale devices, suggesting a practical path toward larger-scale, CMOS-compatible probabilistic computers.

77.8LGMay 3Code
Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee, Corentin Delacour, Kevin Callahan-Coray et al.

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git