LGJun 10, 2025Code
SeerAttention-R: Sparse Attention Adaptation for Long ReasoningYizhao Gao, Shuming Guo, Shijie Cao et al. · tsinghua
We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.
CVMay 2, 2025
Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion SynthesisYu Hua, Weiming Liu, Gui Xu et al.
Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process. In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis. DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with DivSDE.DSDFM is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training parameters.Through qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.
46.2LGApr 7
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert SpaceLi Kunpeng, Wan Chenguang, Qu Zhisong et al.
High-fidelity modeling of turbulent flows requires capturing complex spatiotemporal dynamics and multi-scale intermittency, posing a fundamental challenge for traditional knowledge-based systems. While deep generative models, such as diffusion models and Flow Matching, have shown promising performance, they are fundamentally constrained by their discrete, pixel-based nature. This limitation restricts their applicability in turbulence computing, where data inherently exists in a functional form. To address this gap, we propose Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a generative framework defined directly in infinite-dimensional function space. Unlike conventional approaches defined on fixed grids, FOT-CFM treats physical fields as elements of an infinite-dimensional Hilbert space, and learns resolution-invariant generative dynamics directly at the level of probability measures. By integrating Optimal Transport (OT) theory, we construct deterministic, straight-line probability paths between noise and data measures in Hilbert space. This formulation enables simulation-free training and significantly accelerates the sampling process. We rigorously evaluate the proposed system on a diverse suite of chaotic dynamical systems, including the Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations, all of which exhibit rich multi-scale turbulent structures. Experimental results demonstrate that FOT-CFM achieves superior fidelity in reproducing high-order turbulent statistics and energy spectra compared to state-of-the-art baselines.
CRMar 5, 2021
Update the Root of Integrity Tree in Secure Non-Volatile Memory Systems with Low OverheadJianming Huang, Yu Hua
Data integrity is important for non-volatile memory (NVM) systems that maintain data even without power. The data integrity in NVM is possibly compromised by integrity attacks, which can be defended against by integrity verification via integrity trees. After NVM system failures and reboots, the integrity tree root is responsible for providing a trusted execution environment. However, the root often becomes a performance bottleneck, since updating the root requires high latency on the write critical path to propagate the modifications from leaf nodes to the root. The root and leaf nodes have to ensure the crash consistency between each other to avoid any update failures that potentially result in misreporting the attacks after system reboots. In this paper, we propose an efficient and low-latency scheme, called SCUE, to directly update the root on the SGX integrity tree (SIT) by overlooking the updates upon the intermediate tree nodes. The idea behind SCUE explores and exploits the observation that only the persistent leaf nodes and root are useful to ensure the integrity after system failures and reboots, due to the loss of the cached intermediate tree nodes. To achieve the crash consistency between root and leaf nodes, we accurately predict the updates upon the root and pre-update the root before the leaf nodes are modified. Moreover, the SIT root is difficult to be reconstructed from the leaf nodes since updating one tree node needs its parent node as input. We use a counter-summing approach to reconstructing the SIT from leaf nodes. Our evaluation results show that compared with the state-of-the-art integrity tree update schemes, our SCUE scheme delivers high performance while ensuring the system integrity.
CRDec 10, 2019
A Write-Friendly and Fast-Recovery Scheme for Security Metadata in NVMJianming Huang, Yu Hua
Non-Volatile Memories (NVMs) have attracted the attentions of academia and industry, which is expected to become the next-generation memory. However, due to the nonvolatile property, NVMs become vulnerable to attacks and require security mechanisms, e.g., counter mode encryption and integrity tree, which introduce the security metadata. NVMs promise to recover these security metadata after a system crash, including the counter and integrity tree. However, unlike merkle tree reconstructed from user data, recovering SGX integrity tree (SIT) has to address the challenges from unique top-down hierarchical dependency. Moreover, writing overhead and recovery time are important metrics for evaluating persistent memory system due to the high costs of NVM writes and IT downtime. How to recover the security metadata, i.e., counter blocks and integrity tree nodes, with low write overhead and short recovery time, becomes much important. To provide a fast recovery scheme with low write overhead, we propose STAR, a cost-efficient scheme for recovering counter blocks and SGX integrity tree nodes after crashes. For fast recovery and verification, STAR synergizes the MAC and correct data, uses bitmap lines in ADR to indicate the location of stale node and constructs a cached merkle tree to verify the correctness of the recovery process. Moreover, STAR uses a multi-layer index to speed up the recovery process. STAR also allows different configurations to meet adaptive requirements for write overhead and recovery time. Our evaluation results show that the proposed STAR reduces the number of memory writes by up to 87\% compared with state-of-the-art work, Anubis, which needs extra 1x memory writes. For a 4MB security metadata cache, STAR needs 0.039s/0.023s/0.004s in three different configurations to recover the metadata cache while Anubis needs 0.020s.
DBMay 8, 2019
A Scalable Learned Index Scheme in Storage SystemsPengfei Li, Yu Hua, Pengfei Zuo et al.
Index structures are important for efficient data access, which have been widely used to improve the performance in many in-memory systems. Due to high in-memory overheads, traditional index structures become difficult to process the explosive growth of data, let alone providing low latency and high throughput performance with limited system resources. The promising learned indexes leverage deep-learning models to complement existing index structures and obtain significant memory savings. However, the learned indexes fail to become scalable due to the heavy inter-model dependency and expensive retraining. To address these problems, we propose a scalable learned index scheme to construct different linear regression models according to the data distribution. Moreover, the used models are independent so as to reduce the complexity of retraining and become easy to partition and store the data into different pages, blocks or distributed systems. Our experimental results show that compared with state-of-the-art schemes, AIDEL improves the insertion performance by about 2$\times$ and provides comparable lookup performance, while efficiently supporting scalability.
DCJan 3, 2019
A Secure and Persistent Memory System for Non-volatile MemoryPengfei Zuo, Yu Hua, Yuan Xie
In the non-volatile memory, ensuring the security and correctness of persistent data is fundamental. However, the security and persistence issues are usually studied independently in existing work. To achieve both data security and persistence, simply combining existing persistence schemes with memory encryption is inefficient due to crash inconsistency and significant performance degradation. To bridge the gap between security and persistence, this paper proposes SecPM, a Secure and Persistent Memory system, which consists of a counter cache write-through (CWT) scheme and a locality-aware counter write reduction (CWR) scheme. Specifically, SecPM leverages the CWT scheme to guarantee the crash consistency via ensuring both the data and its counter are durable before the data flush completes, and leverages the CWR scheme to improve the system performance via exploiting the spatial locality of counter storage, log and data writes. We have implemented SecPM in gem5 with NVMain and evaluated it using five widely-used workloads. Extensive experimental results demonstrate that SecPM reduces up to half of write requests and speeds up the transaction execution by 1.3-2.0 times via using the CWR scheme, and achieves the performance close to an un-encrypted persistent memory system for large transactions.
CRMar 15, 2017
Bandwidth-efficient Storage Services for Mitigating Side Channel AttackPengfei Zuo, Yu Hua, Cong Wang et al.
Data deduplication is able to effectively identify and eliminate redundant data and only maintain a single copy of files and chunks. Hence, it is widely used in cloud storage systems to save storage space and network bandwidth. However, the occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a very dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to reveal users' privacy information by exploiting the side channel of network traffic in deduplication. Existing work addresses the LRI attack at the cost of the high bandwidth efficiency of deduplication. In order to address this problem, we propose a simple yet effective scheme, called randomized redundant chunk scheme (RRCS), to significantly mitigate the risk of the LRI attack while maintaining the high bandwidth efficiency of deduplication. The basic idea behind RRCS is to add randomized redundant chunks to mix up the real deduplication states of files used for the LRI attack, which effectively obfuscates the view of the attacker, who attempts to exploit the side channel of network traffic for the LRI attack. Our security analysis shows that RRCS could significantly mitigate the risk of the LRI attack. We implement the RRCS prototype and evaluate it by using three large-scale real-world datasets. Experimental results demonstrate the efficiency and efficacy of RRCS.