abstract:516df51cb2f794f2.tex

1: \begin{abstract}

2: \vspace{-2mm}

3: This work presents \ourmethod, a scale-wise transformer for text-to-image generation.

4: Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance.

5: We then argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation quality.

6: Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %

7: By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details.

8: Extensive human preference studies and automated evaluations show that \ourmethod outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.

9: \end{abstract}

10: