1: \begin{abstract}
2: Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, \emph{etc.} While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (\emph{e.g.} easy to handle positive-unlabeled inputs),
3: fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck.
4:
5: To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both \textbf{memory optimization} and \textbf{approximate computing}. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin
6: but also has a \textbf{2x-4x} performance gain compared to
7: the state-of-the-art GPU solutions.
8: Our implementations are open-sourced and publicly available.
9: \end{abstract}
10: