CVCLJan 14, 2025

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

arXiv:2501.07783v110 citationsh-index: 46Has CodeIEEE Trans Pattern Anal Mach Intell
Originality Highly original
AI Analysis

This addresses efficiency bottlenecks in computer vision and multimodal AI systems, offering a novel architecture for scalable multi-scale feature extraction.

The paper tackles the high computational cost of multi-scale image processing in visual perception by proposing Parameter-Inverted Image Pyramid Networks (PIIP), which uses smaller networks for higher-resolution images and achieves performance gains, such as improving a large vision model by 1%-2% on detection and segmentation with 40%-60% of the original computation.

Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes