2577210d70cf6674.tex
1: \begin{abstract}
2: \vspace{-2.5mm}
3: Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction, but they require a substantial number of parameters and expensive computations.
4: Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use.
5: In this paper, we propose a novel transformer encoder-decoder architecture for
6: 3D human mesh reconstruction from a single image, called \mbox{\textit{FastMETRO}}. 
7: We identify the performance bottleneck in the \mbox{encoder-based} transformers 
8: is caused by the token design which introduces high complexity interactions among input tokens.
9: We disentangle the interactions via an \mbox{encoder-decoder} architecture,
10: which allows our model to demand much fewer parameters and shorter inference time.
11: In addition, we impose the prior knowledge of human body's morphological relationship 
12: via attention masking and mesh upsampling operations,
13: which leads to faster convergence with higher accuracy.
14: Our \mbox{FastMETRO} improves the \mbox{Pareto-front} of accuracy and efficiency, and clearly outperforms \mbox{image-based} methods on Human3.6M and 3DPW.
15: Furthermore, we validate its generalizability on FreiHAND.
16: 
17: \vspace{-1mm}
18: \keywords{3D human mesh recovery, transformer, encoder-decoder}
19: \vspace{-2mm}
20: \end{abstract}
21: