bf4f89a9b8ebf28b.tex
1: \begin{abstract}
2: Diffusion Transformers (DiT) have attracted significant attention in research. However, they suffer from a slow convergence rate. In this paper, we aim to accelerate DiT training without any architectural modification.
3: We identify the following issues in the training process: firstly, certain training strategies do not consistently perform well across different data. Secondly, the effectiveness of supervision at specific timesteps is limited.
4: In response, we propose the following contributions: 
5: (1) We introduce a new perspective for interpreting the failure of the strategies. Specifically, we slightly extend the definition of Signal-to-Noise Ratio (SNR) and suggest observing the Probability Density Function (PDF) of SNR to understand the essence of the data robustness of the strategy. 
6: (2) We conduct numerous experiments and report over one hundred experimental results to empirically summarize a unified accelerating strategy from the perspective of PDF.
7: (3) We develop a new supervision method that further accelerates the training process of DiT. 
8: Based on them, we propose \textbf{FasterDiT}, an exceedingly simple and practicable design strategy. With few lines of code modifications, it achieves 2.30 FID on ImageNet at 256$\times$256 resolution with 1000 iterations, which is comparable to DiT (2.27 FID) but 7$\times$ faster in training.
9: \end{abstract}
10: