CV CL LGOct 6, 2023

Module-wise Adaptive Distillation for Multimodality Foundation Models

Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou

Georgia Tech

arXiv:2310.04550v113.117 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the problem of model compression for multimodal AI systems, offering an incremental improvement over existing layerwise distillation techniques.

The paper tackles the challenge of deploying large multimodal foundation models by proposing a module-wise adaptive distillation method, where they use a multi-armed bandit approach to prioritize distilling more important modules, resulting in improved student model performance on multimodal tasks.

Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently. Such an approach can be naturally formulated as a multi-armed bandit (MAB) problem, where modules and loss decrements are considered as arms and rewards, respectively. We then develop a modified-Thompson sampling algorithm named OPTIMA to address the nonstationarity of module contributions resulting from model updating. Specifically, we leverage the observed contributions in recent history to estimate the changing contribution of each module and select modules based on these estimations to maximize the cumulative contribution. We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the CoCa-Large model (Yu et al., 2022) as the teacher model.

View on arXiv PDF

Similar