How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series
This work addresses a common data type in domains like healthcare and finance, but it is incremental as it provides a comprehensive evaluation rather than introducing a new method.
The paper tackled the problem of effectively fusing mixed-type time series (MTTS) for forecasting by evaluating deep multimodal fusion approaches, finding that performance is substantially influenced by the direction and strength of intermodal interactions, with early and intermediate fusion excelling at capturing fine-grained and coarse-grained cross-modal features, respectively.
Mixed-type time series (MTTS) is a bimodal data type that is common in many domains, such as healthcare, finance, environmental monitoring, and social media. It consists of regularly sampled continuous time series and irregularly sampled categorical event sequences. The integration of both modalities through multimodal fusion is a promising approach for processing MTTS. However, the question of how to effectively fuse both modalities remains open. In this paper, we present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting. Our comparison includes three fusion types (early, intermediate, and late) and five fusion methods (concatenation, weighted mean, weighted mean with correlation, gating, and feature sharing). We evaluate these fusion approaches on three distinct datasets, one of which was generated using a novel framework. This framework allows for the control of key data properties, such as the strength and direction of intermodal interactions, modality imbalance, and the degree of randomness in each modality, providing a more controlled environment for testing fusion approaches. Our findings show that the performance of different fusion approaches can be substantially influenced by the direction and strength of intermodal interactions. The study reveals that early and intermediate fusion approaches excel at capturing fine-grained and coarse-grained cross-modal features, respectively. These findings underscore the crucial role of intermodal interactions in determining the most effective fusion strategy for MTTS forecasting.