CVSep 24, 2025

CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

arXiv:2509.19936v1h-index: 18Has Code
Originality Highly original
AI Analysis

It provides a robust and efficient solution for real-time gaze estimation in interactive systems like human-robot interaction.

The paper tackles gaze estimation by introducing CapStARE, a capsule-based spatiotemporal architecture that achieves state-of-the-art performance with errors of 3.36 on ETH-XGaze and 2.65 on MPIIFaceGaze, while maintaining real-time inference under 10 ms.

We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes