1: %!TEX root = RBSC_short.tex
2:
3: \section{A ``Performance'' Bound} \label{bound}
4:
5: In this section, we present a result that offers some insight into why we could expect the one-step lookahead policy to perform well if the linear programming relaxation of the original problem is sufficiently tight. We begin with the following
6:
7: %Hence, incorporating meaningful constraints reducing the size of the relaxation polytope, such as (\ref{st1}), (\ref{st2}), not only improves the quality of the performance bound, but is also expected to improve the performance of the actual sub-optimal policy obtained. NOT TRUE, BOUND BELOW DOES NOT HAVE RELAXATION QUALITY ON RHS BECAUSE WEIGHT IS SAF INSTEAD OF INITIAL DISTRIBUTION.
8:
9:
10:
11: \begin{lem} \label{lem: super-harmonic}
12: The approximate reward-to-go (\ref{approximate value}) is a feasible solution for the original dual linear program (\ref{eq: dual exact RBSC}).
13: \end{lem}
14:
15: \begin{proof}
16: Consider one constraint in the original dual LP (\ref{eq: dual exact RBSC}), for a fixed state-action tuple $(\mathbf{x},\mathbf{s},\mathbf{a})$. We consider a situation where $s_i \neq a_i$ for all $i \in \{1,\ldots, N\}$. Summing the constraints (\ref{eq: constraint xs,s,a}) over $i$ for the given values of $x_{s_i},s_i,a_i$, we get
17: \begin{align*}
18: & \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}} + \sum_{i=1}^N \mu^{i}_{s_i, a_i} + \sum_{i=1}^N \kappa_{s_i, x_{s_i}} - \sum_{i=1}^N \sum_{i'=1}^N \zeta^{i'}_{a_i} + \sum_{i=1}^N \sum_{a'=1}^N \zeta^{i}_{a'} \nonumber \\
19: & = \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}} + \sum_{i=1}^N \mu^{i}_{s_i, a_i} + \sum_{i=1}^N \kappa_{s_i, x_{s_i}} \geq 0.
20: \end{align*}
21: The cancellation follows from the discussion preceding theorem \ref{thm: equivalence}. Now summing the constraints (\ref{eq: constraint xa,s,a}) over $i$, we also get
22: \begin{align*}
23: & - \alpha \sum_{i=1}^N \sum_{\tilde x_{a_i}} p^{{1}\{i \leq M\}}_{x_{a_i} \tilde x_{a_i}} \lambda^{i}_{a_i, \tilde x_{a_i}} - \sum_{i=1}^N \mu^{i}_{s_i, a_i} - \sum_{i=1}^N \kappa_{a_i, x_{a_i}} \geq \sum_{i=1}^N r^{{1}\{i \leq M\}}_{a_i} (x_{a_i}) - c_{s_i a_i} {{1}\{i \leq M\}}.
24: \end{align*}
25: Finally, we add these two inequalities. We obtain
26: \begin{align*}
27: & \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}} - \alpha \sum_{i=1}^N \sum_{\tilde x_{a_i}} p^{{1}\{i \leq M\}}_{x_{a_i} \tilde x_{a_i}} \lambda^{i}_{a_i, \tilde x_{a_i}} \geq \sum_{i=1}^N \left( r^{{1}\{i \leq M\}}_{a_i} (x_{a_i}) - c_{s_i a_i} {{1}\{i \leq M\}} \right),
28: \end{align*}
29: which is the inequality obtained by using the vector (\ref{approximate value}) in the constraints of (\ref{eq: dual exact RBSC}).
30:
31: The case where $s_i=a_i$ for some $i$ is almost identical, considering the constraints (\ref{eq: constraint xs,s,s}) for the corresponding indices.
32: \end{proof}
33:
34:
35: In the following theorem, the occupation measure $F_\alpha(\nu,\tilde{u})$ is a vector of size $|\mathcal{S}|$, representing the discounted infinite horizon frequencies of the states under policy $\tilde u$ and initial distribution $\nu$ \cite{derman-LP}. The proof of the theorem follows from the analysis presented in \cite{deFarias-ADP}, see \cite{leny06_RBSC_CDC} for more details.
36:
37: %, which we define from the state-action frequency simply as
38: %\[
39: %F_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s}) = \sum_{\mathbf{a} \in \Pi_{[N]}} %f_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s},\mathbf{a}), \; \forall (\mathbf{x},\mathbf{s}).
40: %\]
41:
42: \begin{thm}
43: Let $\nu$ be an initial distribution on the states, of the product form (\ref{eq: initial distribution - product}). Let $J^*$ be the optimal reward function, $\tilde J$ be an approximation of this reward function which is feasible for the LP (\ref{eq: dual exact RBSC}), and $\tilde{u}$ be the associated one-step lookahead policy. Let $F_\alpha(\nu,\tilde{u})$ and $J_{\tilde{u}}$ be the occupation measure vector and the expected reward associated to the policy $\tilde{u}$. Then
44:
45:
46: \begin{align} \label{upperBound}
47: \nu^T(J^*-J_{\tilde{u}}) \leq \frac{1}{1-\alpha} F_\alpha(\nu,\tilde{u})^T (\tilde{J}-J^*).
48: \end{align}
49: %or in other words,
50: %\[
51: %\sum_{\mathbf{x},\mathbf{s}} \nu(\mathbf{x},\mathbf{s}) |J^*(\mathbf{x},\mathbf{s}) - J_{\tilde{u}}(\mathbf{x},\mathbf{s})| \leq
52: %\frac{1}{1-\alpha} \sum_{\mathbf{x},\mathbf{s}} F_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s}) (\tilde J(\mathbf{x},\mathbf{s}) - J^* (\mathbf{x},\mathbf{s})).
53: %\]
54: \end{thm}
55:
56: From lemma \ref{lem: super-harmonic}, the theorem is true in particular for $\tilde J$ formed according to (\ref{approximate value}). In words, it says that starting with a distribution $\nu$ over the states, the difference in expected rewards between the optimal policy and the one-step lookahead policy is bounded by a weighted $l^1$-distance between the estimate $\tilde{J}$ used in the design of the policy and the optimal value function $J^*$. The weights are given by the occupation measure of the one-step lookahead policy. It provides some motivation to obtain a good approximation $\tilde J$, i.e., a tight relaxation, which was an important element of this paper.
57:
58: %This result is true for every one-step lookahead policy that uses a superharmonic vector as an approximation of the cost-to-go. It gives some motivation to obtain a good approximation $\tilde J$ and a tight relaxation. %It is not a strong performance bound however, since the quantity that we would really like to minimize, $f_\alpha(\nu,\tilde{u})^T (\tilde{J}-J^*)$, depends itself on the approximate policy $\tilde u$ to be designed,
59:
60:
61:
62:
63:
64: