proof:37dfef0f966a6e4d.tex

1: \begin{proof}

2: Since the second stopping criterion is triggered whenever the number of visits to a state-action pair is doubled, the start times of macro episodes can be expressed as

3: \begin{align*}

4: \{t_1\}\cup \Big(\cup_{(s,a)\in\mathcal S \times \mathcal A} \{t_k: k \in\mathcal M_{(s,a)} \}\Big)

5: \end{align*}

6: where

7: \begin{align*}

8: &\mathcal M_{(s,a)} =

9: \{k \leq K_T:  N_{t_k}(s,a) > 2N_{t_{k-1}}(s,a)\}.

10: \end{align*}

11: Since the number of visits to $(s,a)$ is doubled at every $t_k$ such that $k\in\mathcal M_{(s,a)}$, the size of $\mathcal M_{(s,a)}$ should not be larger than $O(\log(T))$. This argument is made rigorous as follows.

12:

13: If $|\mathcal M_{(s,a)}| \geq \log(N_{T+1}(s,a))+1$ we have

14: \begin{align*}

15: N_{t_{K_T}}(s,a) = \prod_{k \leq K_T, N_{t_{k-1}}(s,a)\geq 1}\frac{N_{t_{k}}(s,a)}{N_{t_{k-1}}(s,a)}

16: > \prod_{k \in \mathcal M_{(s,a)}, N_{t_{k-1}}(s,a)\geq 1} \hspace{-2em} 2 \hspace{2em}

17: \geq N_{T+1}(s,a).

18: \end{align*}

19: But this contradicts the fact that $N_{t_{K_T}}(s,a) \leq N_{T+1}(s,a)$. Therefore,

20: $|\mathcal M_{(s,a)}| \leq \log(N_{T+1}(s,a))$ for all $(s,a)$.

21: From this property we obtain a bound on the number of macro episodes as

22: \begin{align}

23: M \leq& 1+\sum_{(s,a)}|\mathcal M_{(s,a)}|

24: \leq 1 + \sum_{(s,a)}\log(N_{T+1}(s,a))

25: \notag\\

26: \leq &1 + SA \log(\sum_{(s,a)}N_{T+1}(s,a)/SA)

27: = 1 + SA \log(T/SA) \leq SA \log(T)

28: \label{eq:Mbound}

29: \end{align}

30: where the first inequality is the union bound and the third inequality holds because $\log$ is concave.

31: \end{proof}

32: