CL CVNov 28, 2023

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason

arXiv:2311.17280v41.33 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of data augmentation efficiency for VLN researchers, showing that instruction quality may be less critical than previously assumed, though it is incremental in challenging existing practices.

The study investigated whether noisy, nonsensical or irrelevant language instructions in pretraining affect Vision-and-Language Navigation (VLN) models, finding they have little impact on downstream performance for HAMT and VLN-BERT on R2R, and even outperform using only clean human data, with an efficient augmentation method (Unigram + Object) demonstrating this.

Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

View on arXiv PDF

Similar