1: \begin{abstract}
2: In this paper, we consider reinforcement learning of Markov Decision Processes (MDP) with peak constraints, where an agent chooses a policy to optimize an objective and at the same time satisfy additional peak constraints. The agent has to take actions based on the observed states, reward outputs, and constraint-outputs, without any knowledge about the dynamics, reward functions, and/or the knowledge of the constraint functions. We introduce a transformation of the original problem in order to apply reinforcement learning algorithms where the agent maximizes a bounded and unconstrained objective. We show that the policies obtained from the transformed problem are optimal whenever the original problem is feasible. Out solution is memory efficient and doesn't require to store the values of the constraint functions. To the best of our knowledge, this is the first time learning algorithms guarantee convergence to optimal stationary policies for the MDP problem with peak constraints for discounted and expected average rewards, respectively.
3: \end{abstract}