a8348b00cb1dbef7.tex
1: \begin{abstract}
2:     This paper addresses the challenge of quickly reconstructing free-viewpoint videos of dynamic humans from sparse multi-view videos.
3:     Some recent works represent the dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from videos through differentiable rendering. But the per-scene optimization generally requires hours.
4:     Other generalizable NeRF models leverage learned prior from datasets and reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity.
5:     In this paper, we propose a novel method for learning neural volumetric videos of dynamic humans from sparse view videos in minutes with competitive visual quality.
6:     Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts.
7:     Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning.
8:     Experiments demonstrate that our model can be learned 100 times faster than prior per-scene optimization methods while being competitive in the rendering quality. Training our model on a $512 \times 512$ video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code will be released on our project page: \href{https://zju3dv.github.io/instant_nvr}{https://zju3dv.github.io/instant\_nvr}.
9: \end{abstract}
10: