LANISTR: Multimodal Learning from Structured and Unstructured Data
This work addresses a practical challenge in AI for domains like healthcare and retail by enabling robust learning from mixed data types with high rates of missing modalities, though it is incremental in extending existing multimodal methods to include structured data.
The paper tackles the problem of multimodal learning with structured data (tabular and time-series) alongside unstructured data (language and image), which is understudied in real-world scenarios, and proposes LANISTR, an attention-based framework that achieves improvements of 6.6% in AUROC and 14% in accuracy on healthcare and retail datasets compared to state-of-the-art alternatives.
Multimodal large-scale pretraining has shown impressive performance for unstructured data such as language and image. However, a prevalent real-world scenario involves structured data types, tabular and time-series, along with unstructured data. Such scenarios have been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datasets, MIMIC-IV (from healthcare) and Amazon Product Review (from retail), LANISTR demonstrates remarkable improvements, 6.6\% (in AUROC) and 14\% (in accuracy) when fine-tuned with 0.1\% and 0.01\% of labeled data, respectively, compared to the state-of-the-art alternatives. Notably, these improvements are observed even with very high ratio of samples (35.7\% and 99.8\% respectively) not containing all modalities, underlining the robustness of LANISTR to practical missing modality challenge. Our code and models will be available at https://github.com/google-research/lanistr