1: \begin{abstract}
2: \textit{Nature is infinitely resolution-free}.
3: In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain.
4: To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids.
5: This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping.
6: On this basis, we present the \textbf{Flexible Vision Transformer} (FiT), a transformer architecture specifically designed for generating images with \textit{unrestricted resolutions and aspect ratios}.
7: We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler.
8: Enhanced by a meticulously adjusted network structure, FiTv2 exhibits $2\times$ convergence speed of FiT.
9: When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation.
10: Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency.
11: Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation.
12: Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
13: We have released all the codes and models at \url{https://github.com/whlzy/FiT} to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.
14:
15: \end{abstract}
16: