270316a868d42d9f.tex
1: \begin{abstract}
2: 	Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood.
3: 	In this paper, we study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
4: 	Although inputs and outputs are both infinite dimensional, we show that when the target function has anisotropic smoothness,
5: 	Transformers can avoid the curse of dimensionality due to their feature extraction ability and parameter sharing property.
6: 	In addition, we show that even if the smoothness changes depending on each input,
7: 	Transformers can estimate the importance of features for each input and extract important features dynamically.
8: 	Then, we proved that Transformers achieve similar convergence rate as in the case of the fixed smoothness.
9: 	Our theoretical results support the practical success of Transformers for high dimensional data.
10: \end{abstract}
11: