abstract:644faec6fb25105d.tex

1: \begin{abstract}

2: \textit{Nature is infinitely resolution-free}.

3: In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain.

4: To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids.

5: This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping.

6: On this basis, we present the \textbf{Flexible Vision Transformer} (FiT), a transformer architecture specifically designed for generating images with \textit{unrestricted resolutions and aspect ratios}.

7: We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization,  the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler.

8: Enhanced by a meticulously adjusted network structure, FiTv2 exhibits $2\times$ convergence speed of FiT.

9: When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation.

10: Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency.

11: Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation.

12: Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.

13: We have released all the codes and models at \url{https://github.com/whlzy/FiT} to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

14:

15: \end{abstract}

16: