abstract:67de19292e1f28ea.tex

1: \begin{abstract}

2: % As humans, we encode the world recurrently and pay attention for precise information recalling.

3: Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present \textsc{Samba}, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). \textsc{Samba} selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism.

4: We scale \textsc{Samba} up to 3.8B parameters with 3.2T training tokens and show that \textsc{Samba} substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks.  When trained on 4K length sequences, \textsc{Samba} can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, \textsc{Samba} enjoys a $3.73\times$ higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and $3.64\times$ speedup when generating 64K tokens with unlimited streaming. A sample implementation of \textsc{Samba} is publicly available in \url{https://github.com/microsoft/Samba}.

5:

6: \end{abstract}

7: