Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review
This is an incremental review that synthesizes existing research to guide practitioners and researchers in multimodal educational technology.
This paper tackles the lack of a comprehensive review of empirical methods in applied multimodal learning and training environments by introducing a taxonomy and framework, revealing that integrating modalities enables richer insights into learner behaviors but faces challenges in data collection and integration for real-time classroom use.
Recent technological advancements in multimodal machine learning--including the rise of large language models (LLMs)--have improved our ability to collect, process, and analyze diverse multimodal data such as speech, video, and eye gaze in learning and training contexts. While prior reviews have addressed individual components of the multimodal pipeline (e.g., conceptual models, data fusion), a comprehensive review of empirical methods in applied multimodal environments remains notably absent. This review addresses that, introducing a taxonomy and framework that capture both established practices and recent innovations driven by LLMs and generative AI. We identify five modality groups: Natural Language, Vision, Physiological Signals, Human-Centered Evidence, and Environment Logs. Our analysis reveals that integrating modalities enables richer insights into learner and trainee behaviors, revealing latent patterns often overlooked by unimodal approaches. However, persistent challenges in multimodal data collection and integration continue to hinder the adoption of these systems in real-time classroom settings.