f893ae993e0a2896.tex
1: \begin{abstract}
2: Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce \textbf{T}ext-\textbf{A}ware \textbf{T}ransformer-based 1-D\textbf{i}mensional \textbf{Tok}enizer (\modelname), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. \modelname uniquely integrates textual information during the tokenizer decoding stage (\ie, de-tokenization), accelerating convergence and enhancing performance.
3: \modelname also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image \textbf{Mask}ed \textbf{Gen}erative Models (\genmodelname), trained exclusively on open data while achieving comparable performance to models trained on private data.
4: We aim to release both the efficient, strong \modelname tokenizers and the open-data, open-weight \genmodelname models to promote broader access and democratize the field of text-to-image masked generative models.
5: \end{abstract}
6: