87eda4d2c7be9f0e.tex
1: \begin{abstract}
2:     To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments.
3:     As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case.
4:     Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory.
5:     In this paper, we design a first fully model-free DRRL algorithm, called \emph{distributionally robust Q-learning with single trajectory (DRQ)}.
6:     We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion.
7:     Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. 
8:     Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.
9: \end{abstract}
10: