Efficient Egocentric Action Recognition with Multimodal Data
This work addresses efficiency issues for real-time action recognition on resource-constrained XR devices, offering an incremental improvement through optimized multimodal input strategies.
The paper tackled the challenge of deploying real-time Egocentric Action Recognition (EAR) on wearable XR devices by analyzing trade-offs between accuracy and computational efficiency across RGB video and 3D hand pose modalities. It found that reducing RGB frame sampling rates while using higher-frequency hand pose input can achieve up to a 3x reduction in CPU usage with minimal performance loss.
The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.