abstract:b96592413d3ce7da.tex

1: \begin{abstract}

2: State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy.

3: As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design.

4: We ask: \emph{can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training}?

5: Towards this end, we study \emph{16-bit-FPU training} on the widely adopted \BFHS unit.

6: While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy.

7: Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training.

8: We demonstrate that these two techniques can enable up to $7\%$ absolute validation accuracy gain in 16-bit-FPU training. This leads to $0.1\%$ lower to $0.2\%$ higher validation accuracy compared to 32-bit training across seven deep learning applications.

9: \end{abstract}

10: