d084c1852dfd0af5.tex
1: \begin{abstract}
2: Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated  with  a  multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.
3: 
4: %{\color{blue}(Cho: some potential abation studies: a) Compare sin/consine position encoding + MLP  v.s. MLP only v.s. learnable Fourier features + MLP v.s. random  Fourier features + MLP; this can demonstrate whether learnable Fourier features are important) b) The results with/without MLP, show MLP is also important. }
5: 
6: \end{abstract}
7: