1: \begin{abstract}
2: Until now, domain and human knowledge has been predominantly leveraged by Reinforcement Learning agents through reward shaping. However, directly shaping the policy of the agent rather than its reward offers a potentially wider application scope: expert designers can easily submit a robot to a safe backup policy; regular users can teach their robot-assistant through advising it to take certain actions; knowledge can be transferred from an agent to another; mutually complementary algorithms can be combined; etc. Until now, shaping the agent's policy with an external advisory policy was a technique restricted to value-based methods, such as Q-learning, and SARSA. Our method, Directed Policy Gradient (DPG), extends Policy Gradient-based algorithms to allow an external advisory policy to directly influence the agent's action selection. This way, the agent is biased or forced to explore the areas of interest, without endangering Policy Gradient's convergence. We illustrate the large application potential of our contribution with three experiments: 1) a safety-critical task in which the agent obeys and learns from a designer-provided backup policy; 2) a navigation task for which a previously learned policy for a similar problem is reused as advisory policy; 3) a novel actor-critic formulation where the Softmax policy arising from the critic advises the actor, which we show outperforms conventional actor-critic algorithms.
3:
4:
5: % Leveraging external advisory policies through Policy Shaping has a large application potential:
6:
7:
8: \end{abstract}
9: