CLFeb 28, 2024Code
Evaluating Quantized Large Language ModelsShiyao Li, Xuefei Ning, Luning Wang et al.
Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qllm-eval.
AINov 21, 2020
BARS: Joint Search of Cell Topology and Layout for Accurate and Efficient Binary ARchitecturesTianchen Zhao, Xuefei Ning, Xiangsheng Shi et al.
Binary Neural Networks (BNNs) have received significant attention due to their promising efficiency. Currently, most BNN studies directly adopt widely-used CNN architectures, which can be suboptimal for BNNs. This paper proposes a novel Binary ARchitecture Search (BARS) flow to discover superior binary architecture in a large design space. Specifically, we analyze the information bottlenecks that are related to both the topology and layout architecture design choices. And we propose to automatically search for the optimal information flow. To achieve that, we design a two-level (Macro & Micro) search space tailored for BNNs and apply a differentiable neural architecture search (NAS) to explore this search space efficiently. The macro-level search space includes width and depth decisions, which is required for better balancing the model performance and complexity. We also design the micro-level search space to strengthen the information flow for BNN. %A notable challenge of BNN architecture search lies in that binary operations exacerbate the "collapse" problem of differentiable NAS, for which we incorporate various search and derive strategies to stabilize the search process. On CIFAR-10, BARS achieves 1.5% higher accuracy with 2/3 binary operations and 1/10 floating-point operations comparing with existing BNN NAS studies. On ImageNet, with similar resource consumption, BARS-discovered architecture achieves a 6% accuracy gain than hand-crafted binary ResNet-18 architectures and outperforms other binary architectures while fully binarizing the architecture backbone.