Shape of Motion:
4D Reconstruction from a Single Video

1UC Berkeley   2Google Research
* Equal Contribution  

Shape of Motion reconstructs a 4D scene from a single monocular video.


Abstract

Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches are limited in that they either depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. In this work, we introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion, from casually captured monocular videos. We tackle the under-constrained nature of the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we utilize a comprehensive set of data-driven priors, including monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.



More Results

Input Video

3D Tracks

Novel View

Input Video

3D Tracks

Novel View

Input Video

3D Tracks

Novel View

Input Video

3D Tracks

Novel View

Input Video

3D Tracks

Novel View




3D Tracking Comparison

For each method, we render the video from a novel viewpoint and overlay their predicted 3D tracks onto the novel views. TAPIR + Depth Anything does not produce novel views and we instead overlay their tracks onto our renderings.

HyperNeRF

Deformable-3D-GS

TAPIR + Depth Anything

Ours

HyperNeRF

Deformable-3D-GS

TAPIR + Depth Anything

Ours

HyperNeRF

Deformable-3D-GS

TAPIR + Depth Anything

Ours




Novel View Synthesis Comparison

HyperNeRF

Deformable-3D-GS

Ours

HyperNeRF

Deformable-3D-GS

Ours




2D Tracking Comparison

TAPIR

Ours

TAPIR

Ours



Failure Cases


Fast motion and occlusions are challenging for our method.


Our method relies on off-the-shelf methods, e.g., mono-depth estimation, which can be incorrect.


Acknowledgements

We thank Ruilong Li, Noah Snavely, Brent Yi and Aleksander Holynski for helpful discussion. We are in memory of our beloved cat Sriracha, who will always be missed and loved. This project is supported in part by DARPA No. HR001123C0021. and IARPA DOI/IBC No. 140D0423C0035. The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.

BibTeX


@inproceedings{som2024,
  title     = {Shape of Motion: 4D Reconstruction from a Single Video},
  author    = {Wang, Qianqian and Ye, Vickie and Gao, Hang and Austin, Jake and Li, Zhengqi and Kanazawa, Angjoo},
  journal   = {arXiv preprint arXiv:2407.13764},
  year      = {2024}
}