Shape of Motion reconstructs a 4D scene from a single monocular video.
Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
For each method, we render the video from a novel viewpoint and overlay their predicted 3D tracks onto the novel views. TAPIR + Depth Anything does not produce novel views and we instead overlay their tracks onto our renderings.
Fast motion and occlusions are challenging for our method.
Our method relies on off-the-shelf methods, e.g., mono-depth estimation, which can be incorrect.
@inproceedings{som2024,
title = {Shape of Motion: 4D Reconstruction from a Single Video},
author = {Wang, Qianqian and Ye, Vickie and Gao, Hang and Austin, Jake and Li, Zhengqi and Kanazawa, Angjoo},
journal = {arXiv preprint arXiv:2407.13764},
year = {2024}
}