On Stronger Computational Separations Between Multimodal and Unimodal Machine Learning
This work addresses the theoretical foundations of multimodal vs. unimodal learning for researchers in machine learning theory, but it is incremental as it builds on prior separation results.
The paper tackles the problem of theoretically justifying the empirical success of multimodal machine learning by proving a stronger average-case computational separation where unimodal learning is hard but multimodal learning is easy for typical instances, and shows that such separations imply cryptographic protocols, suggesting they may be infrequent in practice.
Recently, multimodal machine learning has enjoyed huge empirical success (e.g. GPT-4). Motivated to develop theoretical justification for this empirical success, Lu (NeurIPS '23, ALT '24) introduces a theory of multimodal learning, and considers possible \textit{separations} between theoretical models of multimodal and unimodal learning. In particular, Lu (ALT '24) shows a computational separation, which is relevant to \textit{worst-case} instances of the learning task. In this paper, we give a stronger \textit{average-case} computational separation, where for ``typical'' instances of the learning task, unimodal learning is computationally hard, but multimodal learning is easy. We then question how ``natural'' the average-case separation is. Would it be encountered in practice? To this end, we prove that under basic conditions, any given computational separation between average-case unimodal and multimodal learning tasks implies a corresponding cryptographic key agreement protocol. We suggest to interpret this as evidence that very strong \textit{computational} advantages of multimodal learning may arise \textit{infrequently} in practice, since they exist only for the ``pathological'' case of inherently cryptographic distributions. However, this does not apply to possible (super-polynomial) \textit{statistical} advantages.