IRMay 15
Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit QuantizationFujun He, Chuyue Ye, Huaxiang Cai et al.
Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous pipeline comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0* to 62.8* faster index construction than the CPU baseline, up to 4.6* throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over 100* over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.
CLMay 7, 2025
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUsYehui Tang, Yichun Yin, Yaoyuan Wang et al.
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
LGApr 6, 2021
Progressive extension of reinforcement learning action dimension for asymmetric assembly tasksYuhang Gai, Jiuming Guo, Dan Wu et al.
Reinforcement learning (RL) is always the preferred embodiment to construct the control strategy of complex tasks, like asymmetric assembly tasks. However, the convergence speed of reinforcement learning severely restricts its practical application. In this paper, the convergence is first accelerated by combining RL and compliance control. Then a completely innovative progressive extension of action dimension (PEAD) mechanism is proposed to optimize the convergence of RL algorithms. The PEAD method is verified in DDPG and PPO. The results demonstrate the PEAD method will enhance the data-efficiency and time-efficiency of RL algorithms as well as increase the stable reward, which provides more potential for the application of RL.