5844d95b35ed07e3.tex
1: \begin{abstract} \emph{Truncated Backpropagation Through Time} (truncated
2:     BPTT, \cite{jaeger2002tutorial}) is a widespread method for
3:     learning recurrent computational graphs.
4:     Truncated BPTT keeps the computational benefits of
5:     \emph{Backpropagation Through Time} (BPTT \cite{werbos:bptt}) while
6:     relieving the need for a complete backtrack through the whole data
7:     sequence at every step.  However, truncation favors short-term
8:     dependencies: the gradient estimate of truncated
9:     BPTT is biased, so that it does not benefit from the convergence
10:     guarantees from stochastic gradient theory. We introduce \emph{Anticipated Reweighted
11:     Truncated Backpropagation} (ARTBP), an algorithm that keeps the
12:     computational benefits of truncated BPTT, while providing
13:     unbiasedness. ARTBP works by using variable truncation lengths
14:     together with carefully chosen compensation factors in the
15:     backpropagation equation. We check the viability of ARTBP on two
16:     tasks. First,
17: a simple synthetic task where careful balancing of temporal dependencies at different scales is needed: truncated BPTT displays unreliable performance,
18:     and in worst case scenarios, divergence, while ARTBP converges
19:     reliably.
20:     Second, on Penn Treebank character-level language modelling \cite{ptb_proc},
21:     ARTBP slightly outperforms truncated BPTT.
22: \end{abstract}
23: