abstract:496029b654953a93.tex

1: \begin{abstract}

2: Deep learning models, in particular \textit{image} models, have recently gained generalisability and robustness. %are becoming more general and robust by the day.

3: In this work, we propose to exploit such advances in the realm of \textit{video} classification.

4: Video foundation models suffer from the requirement of extensive pretraining and a large training time.

5: Towards mitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion.

6: AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models).

7: Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning.

8: We extend adapters to "\textit{temporal processing adapters}" by incorporating a temporal processing unit into the adapters.

9: Our work achieves faster convergence, therefore reducing the number of epochs needed for training.

10: Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining.

11: We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.

12: Our code will be made available on github.

13: \end{abstract}

14: