abstract:88e1036824772f0f.tex

1: \begin{abstract}

2: Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval.

3: Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies.

4: Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency.

5: We introduce \modelname{}, a Mamba-based video hashing model with an improved self-supervised learning paradigm.

6: Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity.

7: In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal.

8: Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence.

9: Extensive experiments demonstrate \modelname{}'s  improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency.

10: \end{abstract}

11: