CLAILGApr 15, 2022

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

arXiv:2204.07316v3638 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses enhancing natural language understanding for AI applications, though it is incremental as it builds on existing cross-modal and BERT methods.

The study tackled improving language understanding by distilling visual information from multimodal transformers into BERT, resulting in XDBERT outperforming pretrained-BERT on benchmarks like GLUE, SWAG, and readability tests.

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes