47151bd49a488fae.tex
1: \begin{abstract}
2: Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to \emph{unsatisfactory stylization} and \emph{slow convergence}. To address these limitations, we present \shortname{}, an innovative text-driven 3D stylization framework that incorporates a novel \modulename{} (\shortmodulename{}). The \shortmodulename{} dynamically integrates the guidance of the target text by utilizing text-relevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed.
3: Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and non-reproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely \datasetname{}, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that \shortname{} outperforms previous state-of-the-art methods. Our codes and results are available
4: at our project webpage: \url{https://xmu-xiaoma666.github.io/Projects/X-Mesh/}
5: \end{abstract}
6: