CVMay 5, 2022

An Empirical Study on Activity Recognition in Long Surgical Videos

Zhuohong He, Ali Mottaghi, Aidean Sharghi, Muhammad Abdullah Jamal, Omid Mohareri

arXiv:2205.02805v312.218 citationsh-index: 16

Originality Synthesis-oriented

AI Analysis

This work addresses activity recognition for surgical workflow monitoring, but it is incremental as it focuses on empirical benchmarking of existing methods.

The paper benchmarks state-of-the-art deep learning architectures for activity recognition in surgical videos, finding that Swin-Transformer+BiGRU performs strongly on datasets like Cholec80 and Cataract-101, and explores adaptability to new domains through fine-tuning and unsupervised domain adaptation.

Activity recognition in surgical videos is a key research area for developing next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and temporal models to find architectures that yield the strongest performance for surgical activity recognition. We first benchmark the models performance on a large-scale activity recognition dataset containing over 800 surgery videos captured in multiple clinical operating rooms. We further evaluate the models on the two smaller public datasets, the Cholec80 and Cataract-101 datasets, containing only 80 and 101 videos respectively. We empirically found that Swin-Transformer+BiGRU temporal model yielded strong performance on both datasets. Finally, we investigate the adaptability of the model to new domains by fine-tuning models to a new hospital and experimenting with a recent unsupervised domain adaptation approach.

View on arXiv PDF

Similar