CVMay 2

TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

Nima Rahmanian, Daniel Kienzle, Thomas Gossard, Dvij Kalaria, Rainer Lienhart, Shankar Sastry

arXiv:2605.0123417.4h-index: 8

Predicted impact top 44% in CV · last 90 daysOriginality Highly original

AI Analysis

This dataset and pipeline address the bottleneck of reliable 3D reconstruction from monocular sports videos for table tennis, providing a new foundation for downstream tasks.

TT4D introduces a large-scale table tennis dataset with over 140 hours of reconstructed gameplay from monocular videos, enabling virtual replay, player analysis, and robot learning. The proposed lift-first pipeline achieves high-fidelity 3D reconstruction by first lifting the 2D ball track to 3D before time segmentation, overcoming occlusion and viewpoint challenges.

We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides $140+$ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.

View on arXiv PDF

Similar