1: \begin{abstract}
2: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers.
3: In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality.
4: First, we determine that motion induced by camera movements in videos is low-frequency in nature.
5: This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality.
6: Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information.
7: This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to $4\times$ reduction of training parameters, improved training speed and 10\% higher visual quality.
8: Finally, we complement the typical dataset for camera control learning with a curated dataset of $20$K diverse dynamic videos with stationary cameras.
9: This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos.
10: We compound these findings to design the \methodfullname~(\methodname) architecture, the new state-of-the-art model for generative video modeling with camera control.
11: \end{abstract}
12: