PFJan 23, 2023
AttMEMO : Accelerating Transformers with Memoization on Big Memory SystemsYuan Feng, Hyeran Jeon, Filip Blagojevic et al.
Transformer models gain popularity because of their superior inference accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing works on transformer inference acceleration have limitations caused by either the modification of transformer architectures or the need of specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the self-attention mechanism in transformers without the above limitations. Built upon a unique observation that there is rich similarity in attention computation across inference sequences, we build a memoization database that leverages the emerging big memory system. We introduce a novel embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 22% inference-latency reduction on average (up to 68%) with negligible loss in inference accuracy.
57.5ITMay 20
Reed-Muller Codes for Joint Random and Stuck-At Error CorrectionIvana Djurdjevic, Robert Mateescu, Cyril Guyot
Block codes are considered for improving the reliability of messages stored in a computer memory with both stuck-at defects and random errors. It is assumed that the side information about the state of the defects is available to the encoder, but not to the decoder. A novel recursive construction of a set of masks is developed such that it can satisfy any $s$ stuck-at errors in a $2^m$ binary sequence, when $s \leq m$. We prove that the masks generated in this way are codewords in a Reed-Muller $RM(s-1, m)$ code. The constructed set contains no more than $2^s m^{s-1}$ masks. We provide the lower and the upper bound on the size of the stuck-at redundancy, a fixed subset of mask bits that uniquely represents each mask in the set. The stuck-at code constructed in this way is a non-linear code. It is also a subcode of an $RM(r,m)$ code, with $ r \geq s-1$, that can be used for additional random error correction. The encoding requires no mask search and is straightforward based on the description of the recursive construction. The decoding is done in a single attempt and requires almost no additional complexity or latency.
CLOct 29, 2025
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention CacheDinghong Song, Yuan Feng, Yiwei Wang et al.
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
CRSep 21, 2020
On the Efficient Estimation of Min-EntropyYongjune Kim, Cyril Guyot, Young-Sik Kim
The min-entropy is a widely used metric to quantify the randomness of generated random numbers in cryptographic applications; it measures the difficulty of guessing the most likely output. An important min-entropy estimator is the compression estimator of NIST Special Publication (SP) 800-90B, which relies on Maurer's universal test. In this paper, we propose two kinds of min-entropy estimators to improve computational complexity and estimation accuracy by leveraging two variations of Maurer's test: Coron's test (for Shannon entropy) and Kim's test (for Renyi entropy). First, we propose a min-entropy estimator based on Coron's test. It is computationally more efficient than the compression estimator while maintaining the estimation accuracy. The secondly proposed estimator relies on Kim's test that computes the Renyi entropy. This estimator improves estimation accuracy as well as computational complexity. We analytically characterize the bias-variance tradeoff, which depends on the order of Renyi entropy. By taking into account this tradeoff, we observe that the order of two is a proper assignment and focus on the min-entropy estimation based on the collision entropy (i.e., Renyi entropy of order two). The min-entropy estimation from the collision entropy can be described by a closed-form solution, whereas both the compression estimator and the proposed estimator based on Coron's test do not have closed-form solutions. By leveraging the closed-form solution, we also propose a lightweight estimator that processes data samples in an online manner. Numerical evaluations demonstrate that the first proposed estimator achieves the same accuracy as the compression estimator with much less computation. The proposed estimator based on the collision entropy can even improve the accuracy and reduce the computational complexity.