LGARApr 25

Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

arXiv:2604.231721.3h-index: 1
AI Analysis

This work provides insights into design trade-offs for VQ-based model compression, but offers only incremental improvements over existing methods.

The authors developed three techniques for vector quantization-based model weight compression, including cosine similarity-based assignment, top-1 sampling with straight-through estimator, and differentiable NAS for layer-wise quantization. The method does not consistently outperform existing approaches across all quantization levels.

In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes