1: \begin{abstract} \emph{Truncated Backpropagation Through Time} (truncated
2: BPTT, \cite{jaeger2002tutorial}) is a widespread method for
3: learning recurrent computational graphs.
4: Truncated BPTT keeps the computational benefits of
5: \emph{Backpropagation Through Time} (BPTT \cite{werbos:bptt}) while
6: relieving the need for a complete backtrack through the whole data
7: sequence at every step. However, truncation favors short-term
8: dependencies: the gradient estimate of truncated
9: BPTT is biased, so that it does not benefit from the convergence
10: guarantees from stochastic gradient theory. We introduce \emph{Anticipated Reweighted
11: Truncated Backpropagation} (ARTBP), an algorithm that keeps the
12: computational benefits of truncated BPTT, while providing
13: unbiasedness. ARTBP works by using variable truncation lengths
14: together with carefully chosen compensation factors in the
15: backpropagation equation. We check the viability of ARTBP on two
16: tasks. First,
17: a simple synthetic task where careful balancing of temporal dependencies at different scales is needed: truncated BPTT displays unreliable performance,
18: and in worst case scenarios, divergence, while ARTBP converges
19: reliably.
20: Second, on Penn Treebank character-level language modelling \cite{ptb_proc},
21: ARTBP slightly outperforms truncated BPTT.
22: \end{abstract}
23: