DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
This work addresses the challenge of applying camera movement classification models to degraded archival film, offering incremental improvements in robustness for film analysis.
The paper tackles the problem of camera movement classification in archival film, which suffers from noise and low contrast, by introducing DGME-T, a lightweight extension to the Video Swin Transformer that incorporates directional grid motion encoding, resulting in accuracy improvements from 81.78% to 86.14% on modern clips and from 83.43% to 84.62% on historical footage.
Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.