1: \begin{abstract}
2: To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments.
3: As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case.
4: Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory.
5: In this paper, we design a first fully model-free DRRL algorithm, called \emph{distributionally robust Q-learning with single trajectory (DRQ)}.
6: We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion.
7: Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools.
8: Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.
9: \end{abstract}
10: