1: \begin{abstract}
2: Deep learning models, in particular \textit{image} models, have recently gained generalisability and robustness. %are becoming more general and robust by the day.
3: In this work, we propose to exploit such advances in the realm of \textit{video} classification.
4: Video foundation models suffer from the requirement of extensive pretraining and a large training time.
5: Towards mitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion.
6: AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models).
7: Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning.
8: We extend adapters to "\textit{temporal processing adapters}" by incorporating a temporal processing unit into the adapters.
9: Our work achieves faster convergence, therefore reducing the number of epochs needed for training.
10: Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining.
11: We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.
12: Our code will be made available on github.
13: \end{abstract}
14: