e991bf7d9a6582d1.tex
1: \begin{abstract}
2:  Recent advances of gradient temporal-difference
3:  methods allow to learn off-policy multiple value functions in
4:  parallel without sacrificing convergence guarantees or computational efficiency. This opens
5:  up new possibilities for sound ensemble techniques in reinforcement learning.
6:   In this work we propose learning an ensemble of policies related through
7:   potential-based shaping rewards. The ensemble induces a combination
8:   policy by using a voting mechanism on its components. Learning
9:   happens in real time, and we empirically show the combination policy to
10:   outperform the individual policies of the ensemble.
11: \end{abstract}
12: