Sparse Multi-Modal Transformer with Masking for Alzheimer's Disease Classification
This work addresses efficiency and robustness issues for multi-modal intelligent systems under resource constraints, with an incremental improvement in a domain-specific application.
The paper tackled the problem of high computational and energy costs in transformer-based multi-modal systems by proposing SMMT, a sparse multi-modal transformer architecture, which maintained competitive predictive performance while significantly reducing training time, memory usage, and energy consumption on Alzheimer's Disease classification using the ADNI dataset.
Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention, limiting their scalability under resource constraints. This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness. Building upon a cascaded multi-modal transformer framework, SMMT introduces cluster-based sparse attention to achieve near linear computational complexity and modality-wise masking to enhance robustness against incomplete inputs. The architecture is evaluated using Alzheimer's Disease classification on the ADNI dataset as a representative multi-modal case study. Experimental results show that SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines, demonstrating its suitability as a resource-aware architectural component for scalable intelligent systems.