37dfef0f966a6e4d.tex
1: \begin{proof}
2: Since the second stopping criterion is triggered whenever the number of visits to a state-action pair is doubled, the start times of macro episodes can be expressed as 
3: \begin{align*}
4: \{t_1\}\cup \Big(\cup_{(s,a)\in\mathcal S \times \mathcal A} \{t_k: k \in\mathcal M_{(s,a)} \}\Big)
5: \end{align*}
6: where
7: \begin{align*}
8: &\mathcal M_{(s,a)} = 
9: \{k \leq K_T:  N_{t_k}(s,a) > 2N_{t_{k-1}}(s,a)\}.
10: \end{align*}
11: Since the number of visits to $(s,a)$ is doubled at every $t_k$ such that $k\in\mathcal M_{(s,a)}$, the size of $\mathcal M_{(s,a)}$ should not be larger than $O(\log(T))$. This argument is made rigorous as follows.
12: 
13: If $|\mathcal M_{(s,a)}| \geq \log(N_{T+1}(s,a))+1$ we have
14: \begin{align*}
15: N_{t_{K_T}}(s,a) = \prod_{k \leq K_T, N_{t_{k-1}}(s,a)\geq 1}\frac{N_{t_{k}}(s,a)}{N_{t_{k-1}}(s,a)}  
16: > \prod_{k \in \mathcal M_{(s,a)}, N_{t_{k-1}}(s,a)\geq 1} \hspace{-2em} 2 \hspace{2em}
17: \geq N_{T+1}(s,a).
18: \end{align*}
19: But this contradicts the fact that $N_{t_{K_T}}(s,a) \leq N_{T+1}(s,a)$. Therefore, 
20: $|\mathcal M_{(s,a)}| \leq \log(N_{T+1}(s,a))$ for all $(s,a)$.
21: From this property we obtain a bound on the number of macro episodes as
22: \begin{align}
23: M \leq& 1+\sum_{(s,a)}|\mathcal M_{(s,a)}| 
24: \leq 1 + \sum_{(s,a)}\log(N_{T+1}(s,a))
25: \notag\\
26: \leq &1 + SA \log(\sum_{(s,a)}N_{T+1}(s,a)/SA)
27: = 1 + SA \log(T/SA) \leq SA \log(T)
28: \label{eq:Mbound}
29: \end{align}
30: where the first inequality is the union bound and the third inequality holds because $\log$ is concave.
31: \end{proof}
32: