LG AI CL CVOct 20, 2024

Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training

Rohan Saha, Abrar Fahim, Alona Fyshe, Alex Murphy

arXiv:2410.15509v119.315 citationsh-index: 18Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses efficient training for vision-language tasks in data-scarce domains, but it is incremental as it builds on existing curriculum learning and pretraining methods.

The study tackled the problem of training vision-language models with limited data by exploring curriculum learning, pretraining, and model type, finding that curriculum learning improves multimodal evaluations, especially when combined with text-only pretraining, and helps smaller models on text-only tasks.

For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to $\textit{do more with less}$, such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient $\textit{machine}$ learning also take inspiration from $\textit{human}$ learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.

View on arXiv PDF Code

Similar