CVSep 29, 2023
Data-Free Dynamic Compression of CNNs for Tractable EfficiencyLukas Meiner, Jens Mehnert, Alexandru Paul Condurache
To reduce the computational cost of convolutional neural networks (CNNs) on resource-constrained devices, structured pruning approaches have shown promise in lowering floating-point operations (FLOPs) without substantial drops in accuracy. However, most methods require fine-tuning or specific training procedures to achieve a reasonable trade-off between retained accuracy and reduction in FLOPs, adding computational overhead and requiring training data to be available. To this end, we propose HASTE (Hashing for Tractable Efficiency), a data-free, plug-and-play convolution module that instantly reduces a network's test-time inference cost without training or fine-tuning. Our approach utilizes locality-sensitive hashing (LSH) to detect redundancies in the channel dimension of latent feature maps, compressing similar channels to reduce input and filter depth simultaneously, resulting in cheaper convolutions. We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet, where we achieve a 46.72% reduction in FLOPs with only a 1.25% loss in accuracy by swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
CVNov 26, 2025
HTTM: Head-wise Temporal Token Merging for Faster VGGTWeitian Wang, Lukas Meiner, Rai Shubham et al.
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
CVMay 6, 2025
PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNsLukas Meiner, Jens Mehnert, Alexandru Paul Condurache
Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model's energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.