516df51cb2f794f2.tex
1: \begin{abstract}
2: \vspace{-2mm}
3: This work presents \ourmethod, a scale-wise transformer for text-to-image generation. 
4: Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. 
5: We then argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation quality.
6: Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %
7: By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details. 
8: Extensive human preference studies and automated evaluations show that \ourmethod outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.
9: \end{abstract}
10: