CVCLLGOct 23, 2023

LXMERT Model Compression for Visual Question Answering

arXiv:2310.15325v15 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This work addresses model efficiency for vision-language tasks, but it is incremental as it applies known pruning techniques to a specific model and dataset.

The paper tackled the problem of compressing large pretrained models like LXMERT for visual question answering by evaluating trainable subnetworks and pruning strategies, resulting in a 40%-60% size reduction with only a 3% accuracy loss.

Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks. According to the lottery ticket hypothesis, NLP and computer vision models contain smaller subnetworks capable of being trained in isolation to full performance. In this paper, we combine these observations to evaluate whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA task. In addition, we perform a model size cost-benefit analysis by investigating how much pruning can be done without significant loss in accuracy. Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes