1: \begin{abstract}
2: State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy.
3: As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design.
4: We ask: \emph{can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training}?
5: Towards this end, we study \emph{16-bit-FPU training} on the widely adopted \BFHS unit.
6: While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy.
7: Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training.
8: We demonstrate that these two techniques can enable up to $7\%$ absolute validation accuracy gain in 16-bit-FPU training. This leads to $0.1\%$ lower to $0.2\%$ higher validation accuracy compared to 32-bit training across seven deep learning applications.
9: \end{abstract}
10: