1: \begin{abstract}
2: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios.
3: This paper considers the setting of recovering bandits \citep{pike-burkeRecoveringBandits2019a}, where the reward depends on the number of rounds elapsed since the last time an arm was pulled.
4: We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states.
5: The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.
6: %Many bandit algorithms assume that rewards for all arms are constant across rounds. Nevertheless, this assumption is inapplicable in numerous settings. Instead, we assume that the reward depends on elapsed rounds since the last one an arm was pulled, called recovering bandits \cite{pike-burkeRecoveringBandits2019a}. By treating the rounds as states, we propose a new reinforcement learning (RL) algorithm for this setting: State-Separate SARSA (SS-SARSA) algorithm. The proposed algorithm reduces the number of state combinations necessary for efficient learning. Besides, our algorithm has few assumptions about the reward structure and lower computational complexity. Furthermore, we prove asymptotic convergence to optimal policy under mild assumptions. Simulation studies show the superior performance of our algorithm in various settings.
7: \end{abstract}
8: