256a3a39950ee26f.tex
1: \begin{abstract}
2: %With the increasing concerns over data privacy, on-device training is becoming an important machine learning paradigm that allows either a single or many devices to eatablish a model without giving away their training data. To make on-device training practical, a key obstacle is its prohibitively costly resource usage, especially the long training time and energy consumption.
3: 
4: This paper proposes \sys, the first system that enables highly resource-efficient on-device training by orchestrating the mixed-precision training with on-chip Digital Signal Processing (DSP) offloading.
5: \sys fully explores the advantages of DSP in integer-based numerical calculation by four novel techniques:
6: (1) a CPU-DSP co-scheduling scheme to mitigate the overhead from DSP-unfriendly operators;
7: (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation;
8: (3) a batch-splitting algorithm to improve the DSP cache efficiency;
9: (4) a DSP-compute subgraph reusing mechanism to eliminate the preparation overhead on DSP.
10: We have fully implemented \sys and demonstrate its effectiveness through extensive experiments.
11: The results show that, compared to the state-of-the-art DNN engines from \texttt{TFLite} and \texttt{MNN}, \sys reduces the per-batch training time by 5.5$\times$ and the energy consumption by 8.9$\times$ on average.
12: In end-to-end training tasks, \sys reduces up to 10.7$\times$ convergence time and 13.1$\times$ energy consumption, with only 1.9\%--2.7\% accuracy loss compared to the FP32 precision setting.
13: \end{abstract}
14: