ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models
This work addresses action recognition challenges in industrial environments, offering a more efficient and scalable solution, though it appears incremental as it combines existing methods like Grounding DINO, BLIP-2, YOLOv5, and ViT.
The paper tackled the problem of high deployment costs, poor generalization, and limited real-time performance in industrial action recognition by proposing a low-cost real-time framework using foundation models, which improved recognition accuracy, scenario generalization, and deployment efficiency in real-world experiments.
Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.