abstract:15db01a2c929963d.tex

1: \begin{abstract}

2: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers.

3: In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality.

4: First, we determine that motion induced by camera movements in videos is low-frequency in nature.

5: This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality.

6: Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information.

7: This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to $4\times$ reduction of training parameters, improved training speed and 10\% higher visual quality.

8: Finally, we complement the typical dataset for camera control learning with a curated dataset of $20$K diverse dynamic videos with stationary cameras.

9: This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos.

10: We compound these findings to design the \methodfullname~(\methodname) architecture, the new state-of-the-art model for generative video modeling with camera control.

11: \end{abstract}

12: