MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning
This work addresses the problem of architecture selection for multimodal fusion, which is important for researchers and practitioners in multimodal machine learning, though it appears incremental as it builds on existing architecture search methods.
The paper tackles the challenge of selecting optimal deep learning architectures for multimodal data fusion by introducing MixMAS, a sampling-based framework that automatically identifies the best MLP-based architecture for multimodal learning tasks, achieving improved performance through systematic exploration of encoder, fusion function, and network combinations.
Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task's performance metrics.