abstract:270316a868d42d9f.tex

1: \begin{abstract}

2: 	Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood.

3: 	In this paper, we study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.

4: 	Although inputs and outputs are both infinite dimensional, we show that when the target function has anisotropic smoothness,

5: 	Transformers can avoid the curse of dimensionality due to their feature extraction ability and parameter sharing property.

6: 	In addition, we show that even if the smoothness changes depending on each input,

7: 	Transformers can estimate the importance of features for each input and extract important features dynamically.

8: 	Then, we proved that Transformers achieve similar convergence rate as in the case of the fixed smoothness.

9: 	Our theoretical results support the practical success of Transformers for high dimensional data.

10: \end{abstract}

11: