1: \begin{abstract}
2: Matrix factorization (MF) has been widely used in recommender systems, database systems, topic modeling, word embedding and others. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning.
3: We observed that SGD for MF is memory bound. Meanwhile, single-node CPU systems with caches perform well only for small data sets; distributed systems have higher aggregated memory bandwidth but suffer from relatively slow network connection. This observation inspires us to accelerate MF by utilizing GPUs's high memory bandwidth and fast intra-node connection.
4:
5: We present \textbf{cuMF\_SGD}, a CUDA-based SGD solution for large-scale MF problems. On a single GPU, we design two workload scheduling schemes (batch-Hogwild! and wavefront-update) that fully exploit the massive amount of cores. Especially, batch-Hogwild!, a vectorized version of Hogwild!, overcomes the issue of memory discontinuity. We develop highly-optimized kernels for SGD update, leveraging cache, warp-shuffle instructions, half-precision floats, etc. We also design a partition scheme to utilize multiple GPUs while addressing the well-known convergence issue when parallelizing SGD. Evaluations on three data sets with only one Maxwell or Pascal GPU show that cuMF\_SGD runs \textbf{3.1X-28.2X} as fast compared with state-of-art CPU solutions on 1-64 CPU nodes. Evaluations also show that cuMF\_SGD scales well with multiple GPUs on large data sets. Finally, we believe that the lessons learned from building cuMF\_SGD are applicable to other machine learning algorithms on, e.g., (1) embedding layers in deep learning and (2) bipartite graph.
6:
7: \end{abstract}
8: