abstract:2458211eb4f8b152.tex

1: \begin{abstract}

2: We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a \emph{non-oblivious} strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well  with oblivious adversaries  can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: \emph{MDP-Online Oracle Expert} (MDP-OOE), that achieves a policy regret bound of  $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the  learning dynamics of no-regret methods,  under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result  to a NE.  To our best knowledge, this is first work  leading to the last iteration result  in OMDPs.\footnote{Accepted at Autonomous Agents and Multi-Agent Systems (2023): \url{https://doi.org/10.1007/s10458-023-09599-5}.}

3: \end{abstract}

4: