abstract:b012e9176d8e6510.tex

1: \begin{abstract}

2: Exploration in reinforcement learning (RL) remains an open challenge.

3: RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all.

4: To improve exploration and reward discovery, popular algorithms rely on

5: % adding random noise to the agent's actions, intrinsic rewards, or

6: optimism.

7: % But what if sometimes rewards are \textit{unobservable}, i.e., there are situations when the agent is unable to observe the given reward --- yet the reward exist?

8: % Prior work on partial monitoring has shown that these methods fail in bandits with partial information.

9: % Yet, the problem of reward unobservability is still new to MDPs.

10: % With this paper, we want to fill this gap.

11: But what if sometimes rewards are \textit{unobservable}, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process?

12: In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty.

13: With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable.

14: We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

15: \end{abstract}

16: