b012e9176d8e6510.tex
1: \begin{abstract}
2: Exploration in reinforcement learning (RL) remains an open challenge.
3: RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. 
4: To improve exploration and reward discovery, popular algorithms rely on 
5: % adding random noise to the agent's actions, intrinsic rewards, or 
6: optimism. 
7: % But what if sometimes rewards are \textit{unobservable}, i.e., there are situations when the agent is unable to observe the given reward --- yet the reward exist? 
8: % Prior work on partial monitoring has shown that these methods fail in bandits with partial information. 
9: % Yet, the problem of reward unobservability is still new to MDPs. 
10: % With this paper, we want to fill this gap. 
11: But what if sometimes rewards are \textit{unobservable}, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? 
12: In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty.
13: With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. 
14: We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones. 
15: \end{abstract}
16: