abstract:93d1647eb6b8870c.tex

1: \begin{abstract}

2:

3: All existing 3D-from-2D generators are designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location and orientation, and the camera always points to the center of the scene.

4: This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses.

5: In this work, we develop \textit{\modelfullname\ (\modelname)}: a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet.

6: Our model is based on three new ideas.

7: First, we incorporate an \textit{inaccurate} off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision.

8: Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training.

9: Finally, we extend the recent ideas of transferring knowledge from pretrained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator.

10: It achieves more stable training than the existing methods and speeds up the convergence by at least 40\%.

11: We explore our model on four datasets: SDIP Dogs $256^2$, SDIP Elephants $256^2$, LSUN Horses $256^2$, and ImageNet $256^2$ and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality.

12:

13: \begin{center}

14: Code and visualizations: \projecthref

15: \end{center}

16:

17: \end{abstract}

18: