abstract:765c89e29eb363db.tex

1: \begin{abstract}

2:

3: \rn{Model-free} Reinforcement Learning (RL) generally suffers from poor sample complexity, mostly due to the need to exhaustively explore the state\n{-action} space to find \n{well-performing}\w{good} policies. On the other hand, we postulate that expert knowledge of the system \del{to control }often allows us to design simple rules we expect good policies to follow at all times. In this work, we hence propose a simple yet effective modification of continuous actor-critic \del{RL }frameworks to incorporate such \new{rules}\del{prior knowledge in the learned policies} and \w{constrain RL agents}\del{them}\w{ to}\n{avoid} regions of the state\new{-action} space that are \w{deemed interesting}\n{known to be suboptimal}, thereby significantly accelerating the\w{ir} convergence \n{of RL agents}. Concretely, we saturate the actions chosen by the agent if they do not comply with our intuition and, critically, modify the gradient update step of the policy to ensure the learning process is not affected \n{by} the saturation step. On a room temperature control \del{simulation }case study, \new{it}\del{these modifications} allow\new{s} agents to converge to well-performing policies up to \new{$6-7\times$}\del{one order of magnitude} faster than classical \del{RL }agents \new{without computational overhead and} while retaining good final performance.

4:

5: \end{abstract}

6: