RAFT-3D: Scene Flow using Rigid-Motion Embeddings
This work provides a substantial improvement in scene flow estimation for computer vision researchers and applications requiring accurate 3D motion understanding, particularly in autonomous driving and robotics.
This paper tackles the problem of scene flow, which involves estimating pixel-wise 3D motion from stereo or RGB-D video frames. The proposed RAFT-3D model significantly improved accuracy on FlyingThings3D from 34.3% to 83.7% and achieved a lower error of 5.77 on KITTI compared to the previous best of 6.31.
We address the problem of scene flow: given a pair of stereo or RGB-D video frames, estimate pixelwise 3D motion. We introduce RAFT-3D, a new deep architecture for scene flow. RAFT-3D is based on the RAFT model developed for optical flow but iteratively updates a dense field of pixelwise SE3 motion instead of 2D motion. A key innovation of RAFT-3D is rigid-motion embeddings, which represent a soft grouping of pixels into rigid objects. Integral to rigid-motion embeddings is Dense-SE3, a differentiable layer that enforces geometric consistency of the embeddings. Experiments show that RAFT-3D achieves state-of-the-art performance. On FlyingThings3D, under the two-view evaluation, we improved the best published accuracy (d < 0.05) from 34.3% to 83.7%. On KITTI, we achieve an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision. Code is available at https://github.com/princeton-vl/RAFT-3D.