abstract:7f1a72c2c2b8dee8.tex

1: \begin{abstract}

2: Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, \emph{etc.} While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (\emph{e.g.} easy to handle positive-unlabeled inputs),

3: fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck.

4:

5: To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both \textbf{memory optimization} and \textbf{approximate computing}. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin

6: but also has a \textbf{2x-4x} performance gain compared to

7: the state-of-the-art GPU solutions.

8: Our implementations are open-sourced and publicly available.

9: \end{abstract}

10: