1: \begin{abstract}
2: % With the development of multimedia applications, multimodal recommendation with rich item context (\eg, images and text) plays an important role in modern recommender systems.
3: % However, current methods mainly regard multimodal information as an auxiliary without considering their semantic gaps and the following alignment problem, resulting in insufficient representations on users and items.
4: % In this paper, we provide an alignment view for multimodal recommendation and propose a novel framework named \textbf{our-model-name}.
5: % We separate the multimodal recommendation task into three alignment stages: Inter-modality alignment, modality-ID alignment, and user-item alignment. These stages align and fuse different ID or modality features and finally generate user/item representations with fine-grained context and behavior knowledge. We also propose to decouple the whole training process into two sub-processes to avoid the convergence and optimization difficulties. Based on our decoupled training strategies, we design intermediate evaluation protocols to validate the influence of aligned multimodal representations to recommendation. Comprehensive experiments validate the effectiveness, efficiency and generalizability of our framework.
6:
7: % \end{abstract}
8: