e2d183b48ed56551.tex
1: \begin{abstract}
2: This work presents a modification of the self-attention dynamics proposed by~\citet{mathpersp23} to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of~\citet{mathpersp23} in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity.
3: Additionally, we establish a connection to the classical R\'enyi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.
4: 
5: 
6:   \end{abstract}
7: