Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
It addresses the under-explored multimodal aspect of AutoML for practitioners dealing with flexible combinations of image, text, and tabular data, but it is incremental as it builds on existing AutoML efforts.
This paper tackled the problem of best practices for automatic machine learning (AutoML) with multimodal data, focusing on classification and regression using image, text, and tabular data, and achieved robust performance by distilling effective strategies into a unified pipeline across 22 datasets.
This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.