abstract:2577210d70cf6674.tex

1: \begin{abstract}

2: \vspace{-2.5mm}

3: Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction, but they require a substantial number of parameters and expensive computations.

4: Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use.

5: In this paper, we propose a novel transformer encoder-decoder architecture for

6: 3D human mesh reconstruction from a single image, called \mbox{\textit{FastMETRO}}.

7: We identify the performance bottleneck in the \mbox{encoder-based} transformers

8: is caused by the token design which introduces high complexity interactions among input tokens.

9: We disentangle the interactions via an \mbox{encoder-decoder} architecture,

10: which allows our model to demand much fewer parameters and shorter inference time.

11: In addition, we impose the prior knowledge of human body's morphological relationship

12: via attention masking and mesh upsampling operations,

13: which leads to faster convergence with higher accuracy.

14: Our \mbox{FastMETRO} improves the \mbox{Pareto-front} of accuracy and efficiency, and clearly outperforms \mbox{image-based} methods on Human3.6M and 3DPW.

15: Furthermore, we validate its generalizability on FreiHAND.

16:

17: \vspace{-1mm}

18: \keywords{3D human mesh recovery, transformer, encoder-decoder}

19: \vspace{-2mm}

20: \end{abstract}

21: