CV LGJul 20, 2023

GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

Nisarg A. Shah, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

arXiv:2307.11081v17.69 citationsh-index: 81Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of improving patient safety and decision-making during surgeries for medical professionals, but it is incremental as it builds on existing transformer methods for a specific domain.

The paper tackled automated surgical step recognition by proposing a vision transformer-based approach that jointly learns spatio-temporal features with a gated-temporal attention mechanism, achieving superior performance on Cataract-101 and D99 datasets compared to state-of-the-art methods.

Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer

View on arXiv PDF Code

Similar