abstract:1e8b84fea027f64e.tex

1: \begin{abstract}

2:

3: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an {\textbf{optimization dilemma}} in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance.

4: Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs.

5: We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed \textbf{VA-VAE} (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces.

6: To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed \textbf{LightningDiT}.

7: The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256$\times$256 generation with an \textbf{FID score of 1.35} while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs -- representing an over 21$\times$ convergence speedup compared to the original DiT.

8: Models and codes are available at \url{https://github.com/hustvl/LightningDiT}.

9:

10: \end{abstract}

11: