CVJan 11, 2020

Towards Generalizable Surgical Activity Recognition Using Spatial Temporal Graph Convolutional Networks

arXiv:2001.03728v48.520 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of generalizability across tasks and datasets in surgical activity recognition, which is important for improving automated surgical systems, though it appears incremental as it builds on existing graph convolutional network methods.

The paper tackles the problem of generalizable surgical activity recognition by introducing a spatial temporal graph representation of surgical tools, achieving 68% average accuracy on the Suturing task of the JIGSAWS dataset, significantly above the 10% chance baseline.

Modeling and recognition of surgical activities poses an interesting research problem. Although a number of recent works studied automatic recognition of surgical activities, generalizability of these works across different tasks and different datasets remains a challenge. We introduce a modality that is robust to scene variation, and that is able to infer part information such as orientational and relative spatial relationships. The proposed modality is based on spatial temporal graph representations of surgical tools in videos, for surgical activity recognition. To explore its effectiveness, we model and recognize surgical gestures with the proposed modality. We construct spatial graphs connecting the joint pose estimations of surgical tools. Then, we connect each joint to the corresponding joint in the consecutive frames forming inter-frame edges representing the trajectory of the joint over time. We then learn hierarchical spatial temporal graph representations using Spatial Temporal Graph Convolutional Networks (ST-GCN). Our experiments show that learned spatial temporal graph representations perform well in surgical gesture recognition even when used individually. We experiment with the Suturing task of the JIGSAWS dataset where the chance baseline for gesture recognition is 10%. Our results demonstrate 68% average accuracy which suggests a significant improvement. Learned hierarchical spatial temporal graph representations can be used either individually, in cascades or as a complementary modality in surgical activity recognition, therefore provide a benchmark for future studies. To our knowledge, our paper is the first to use spatial temporal graph representations of surgical tools, and pose-based skeleton representations in general, for surgical activity recognition.

View on arXiv PDF

Similar