Xinmiao Zhang

h-index13

9papers

54citations

Novelty51%

AI Score50

Ranked #43,710 of 201,326 authors (top 22%)#842 in CR (top 12%)

9 Papers

AIFeb 19, 2023

Language-Specific Representation of Emotion-Concept Knowledge Causally Supports Emotion Inference

Ming Li, Yusheng Su, Hsiu-Yuan Huang et al. · tsinghua

Humans no doubt use language to communicate about their emotional experiences, but does language in turn help humans understand emotions, or is language just a vehicle of communication? This study used a form of artificial intelligence (AI) known as large language models (LLMs) to assess whether language-based representations of emotion causally contribute to the AI's ability to generate inferences about the emotional meaning of novel situations. Fourteen attributes of human emotion concept representation were found to be represented by the LLM's distinct artificial neuron populations. By manipulating these attribute-related neurons, we in turn demonstrated the role of emotion concept knowledge in generative emotion inference. The attribute-specific performance deterioration was related to the importance of different attributes in human mental space. Our findings provide a proof-in-concept that even a LLM can learn about emotions in the absence of sensory-motor representations and highlight the contribution of language-derived emotion-concept knowledge for emotion inference.

CLDec 3, 2025

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang et al.

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

23.0CRMay 17

Triple-Hoisted Baby-Step Giant-Step Linear Transformation over CKKS Homomorphic Encryption and Hardware Accelerator

Sajjad Akherati, Xinmiao Zhang

Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.

CRJan 21

Multi-Input Ciphertext Multiplication for Homomorphic Encryption

Sajjad Akherati, Xinmiao Zhang

Homomorphic encryption (HE) enables arithmetic operations to be performed directly on encrypted data. It is essential for privacy-preserving applications such as machine learning, medical diagnosis, and financial data analysis. In popular HE schemes, ciphertext multiplication is only defined for two inputs. However, the multiplication of multiple inputs is needed in many HE applications. In our previous work, a three-input ciphertext multiplication method for the CKKS HE scheme was developed. This paper first reformulates the three-input ciphertext multiplication to enable the combination of computations in order to further reduce the complexity. The second contribution is extending the multiplication to multiple inputs without compromising the noise overhead. Additional evaluation keys are introduced to achieve relinearization of polynomial multiplication results. To minimize the complexity of the large number of rescaling units in the multiplier, a theoretical analysis is developed to relocate the rescaling, and a multi-level rescaling approach is proposed to implement combined rescaling with complexity similar to that of a single rescaling unit. Guidelines and examples are provided on the input partition to enable the combination of more rescaling. Additionally, efficient hardware architectures are designed to implement our proposed multipliers. The improved three-input ciphertext multiplier reduces the logic area and latency by 15% and 50%, respectively, compared to the best prior design. For multipliers with more inputs, ranging from 4 to 12, the architectural analysis reveals 32% savings in area and 45% shorter latency, on average, compared to prior work.

43.6CRMar 20

HQC Post-Quantum Cryptography Decryption with Generalized Minimum-Distance Reed-Solomon Decoder

Jiaxuan Cai, Xinmiao Zhang

Hamming Quasi-Cyclic (HQC) was chosen for the latest post-quantum cryptography standardization. A concatenated Reed-Muller (RM) and Reed-Solomon (RS) code is decoded during the HQC decryption. Soft-decision RS decoders achieve better error-correcting performance than hard-decision decoders and accordingly shorten the required codeword and key lengths. However, the only soft-decision decoder for HQC in prior works is an erasure-only decoder, which has limited coding gain. This paper analyzes other hardware-friendly soft-decision RS decoders and discovers that the generalized minimum-distance (GMD) decoder can better utilize the soft information available in HQC. Extending the Agrawal-Vardy bound for the scenario of HQC, it was found that the RS codeword length for HQC-128 can be reduced from 46 to 36. This paper also proposes efficient GMD decoder hardware architectures optimized for the short and low-rate RS codes used in HQC. The HQC-128 decryption utilizing the proposed GMD decoder achieves 20% and 15% reductions on the latency and area, respectively, compared to the decryption with hard-decision decoders.

CROct 23, 2021

High-Speed VLSI Architectures for Modular Polynomial Multiplication via Fast Filtering and Applications to Lattice-Based Cryptography

Weihang Tan, Antian Wang, Yingjie Lao et al.

This paper presents a low-latency hardware accelerator for modular polynomial multiplication for lattice-based post-quantum cryptography and homomorphic encryption applications. The proposed novel modular polynomial multiplier exploits the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication. We also extend this structure to fast $M$-parallel architectures while achieving low-latency, high-speed, and full hardware utilization. We comprehensively evaluate the performance of the proposed architectures under various polynomial settings as well as in the Saber scheme for post-quantum cryptography as a case study. The experimental results show that our proposed modular polynomial multiplier reduces the computation time and area-time product, respectively, compared to the state-of-the-art designs.

ARApr 8, 2021

Algorithmic Obfuscation for LDPC Decoders

Jingbo Zhou, Xinmiao Zhang

In order to protect intellectual property against untrusted foundry, many logic-locking schemes have been developed. The main idea of logic locking is to insert a key-controlled block into a circuit to make the circuit function incorrectly without right keys. However, in the case that the algorithm implemented by the circuit is naturally fault-tolerant or self-correcting, existing logic-locking schemes do not affect the system performance much even if wrong keys are used. One example is low-density parity-check (LDPC) error-correcting decoder, which has broad applications in digital communications and storage. This paper proposes two algorithmic-level obfuscation methods for LDPC decoders. By modifying the decoding process and locking the stopping criterion, our new designs substantially degrade the decoder throughput and/or error-correcting performance when the wrong key is used. Besides, our designs are also resistant to the SAT, AppSAT and removal attacks. For an example LDPC decoder, our proposed methods reduce the throughput to less than 1/3 and/or increase the decoder error rate by at least two orders of magnitude with only 0.33% area overhead.

LGNov 18, 2019

RWNE: A Scalable Random-Walk-Based Network Embedding Framework with Personalized Higher-Order Proximity Preserved

Jianxin Li, Cheng Ji, Hao Peng et al.

Higher-order proximity preserved network embedding has attracted increasing attention. In particular, due to the superior scalability, random-walk-based network embedding has also been well developed, which could efficiently explore higher-order neighborhoods via multi-hop random walks. However, despite the success of current random-walk-based methods, most of them are usually not expressive enough to preserve the personalized higher-order proximity and lack a straightforward objective to theoretically articulate what and how network proximity is preserved. In this paper, to address the above issues, we present a general scalable random-walk-based network embedding framework, in which random walk is explicitly incorporated into a sound objective designed theoretically to preserve arbitrary higher-order proximity. Further, we introduce the random walk with restart process into the framework to naturally and effectively achieve personalized-weighted preservation of proximities of different orders. We conduct extensive experiments on several real-world networks and demonstrate that our proposed method consistently and substantially outperforms the state-of-the-art network embedding methods.

CROct 26, 2019

Generalized SAT-Attack-Resistant Logic Locking

Jingbo Zhou, Xinmiao Zhang

Logic locking is used to protect integrated circuits (ICs) from piracy and counterfeiting. An encrypted IC implements the correct function only when the right key is input. Many existing logic-locking methods are subject to the powerful satisfiability (SAT)-based attack. Recently, an Anti-SAT scheme has been developed. By adopting two complementary logic blocks that consist of AND/NAND trees, it makes the number of iterations needed by the SAT attack exponential to the number of input bits. Nevertheless, the Anti-SAT scheme is vulnerable to the later AppSAT and removal attacks. This paper proposes a generalized (G-)Anti-SAT scheme. Different from the Anti-SAT scheme, a variety of complementary or non-complementary functions can be adopted for the two blocks in our G-Anti-SAT scheme. The Anti-SAT scheme is just a special case of our proposed design. Our design can achieve higher output corruptibility, which is also tunable, so that better resistance to the AppSAT and removal attacks is achieved. Meanwhile, unlike existing AppSAT-resilient designs, our design does not sacrifice the resistance to the SAT attack.