DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models
This work addresses robotic control for adaptive systems, but it is incremental as it combines pre-trained models with existing machine learning algorithms.
The paper tackles robotic arm manipulation by proposing a deep multimodal learning framework that integrates image sequences and robot state data using a late-fusion strategy, achieving MSEs of 0.0021 and 0.0028 on BridgeData V2 and Kuka datasets.
This paper presents a novel deep learning framework for robotic arm manipulation that integrates multimodal inputs using a late-fusion strategy. Unlike traditional end-to-end or reinforcement learning approaches, our method processes image sequences with pre-trained models and robot state data with machine learning algorithms, fusing their outputs to predict continuous action values for control. Evaluated on BridgeData V2 and Kuka datasets, the best configuration (VGG16 + Random Forest) achieved MSEs of 0.0021 and 0.0028, respectively, demonstrating strong predictive performance and robustness. The framework supports modularity, interpretability, and real-time decision-making, aligning with the goals of adaptive, human-in-the-loop cyber-physical systems.