TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation
This addresses the sim-to-real gap in event camera motion estimation for robotics and vision applications, offering a more data-efficient solution.
The paper tackles the problem of event-based motion estimation, which suffers from a sim-to-real gap due to reliance on synthetic data, by proposing TETO, a teacher-student framework that learns from only ~25 minutes of unannotated real-world recordings and achieves state-of-the-art point tracking on EVIMO2 and optical flow on DSEC with orders of magnitude less training data.
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.