0805:0805.1563/perf_analysis.tex

1: %!TEX root =  RBSC_short.tex

2:

3: \section{A ``Performance'' Bound} \label{bound}

4:

5: In this section, we present a result that offers some insight into why we could expect the one-step lookahead policy to perform well if the linear programming relaxation of the original problem is sufficiently tight. We begin with the following

6:

7: %Hence, incorporating meaningful constraints reducing the size of the relaxation polytope, such as (\ref{st1}), (\ref{st2}), not only improves the quality of the performance bound, but is also expected to improve the performance of the actual sub-optimal policy obtained. NOT TRUE, BOUND BELOW DOES NOT HAVE RELAXATION QUALITY ON RHS BECAUSE WEIGHT IS SAF INSTEAD OF INITIAL DISTRIBUTION.

8:

9:

10:

11: \begin{lem} \label{lem: super-harmonic}

12: The approximate reward-to-go (\ref{approximate value}) is a feasible solution for the original dual linear program (\ref{eq: dual exact RBSC}).

13: \end{lem}

14:

15: \begin{proof}

16: Consider one constraint in the original dual LP (\ref{eq: dual exact RBSC}), for a fixed state-action tuple $(\mathbf{x},\mathbf{s},\mathbf{a})$. We consider a situation where $s_i \neq a_i$ for all $i \in \{1,\ldots, N\}$. Summing the constraints (\ref{eq: constraint xs,s,a}) over $i$ for the given values of $x_{s_i},s_i,a_i$, we get

17: \begin{align*}

18: & \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}} + \sum_{i=1}^N \mu^{i}_{s_i, a_i} + \sum_{i=1}^N \kappa_{s_i, x_{s_i}} - \sum_{i=1}^N \sum_{i'=1}^N \zeta^{i'}_{a_i} + \sum_{i=1}^N \sum_{a'=1}^N \zeta^{i}_{a'} \nonumber \\

19: & = \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}} + \sum_{i=1}^N \mu^{i}_{s_i, a_i} + \sum_{i=1}^N \kappa_{s_i, x_{s_i}} \geq 0.

20: \end{align*}

21: The cancellation follows from the discussion preceding theorem \ref{thm: equivalence}. Now summing the constraints (\ref{eq: constraint xa,s,a}) over $i$, we also get

22: \begin{align*}

23: & - \alpha \sum_{i=1}^N \sum_{\tilde x_{a_i}} p^{{1}\{i \leq M\}}_{x_{a_i} \tilde x_{a_i}} \lambda^{i}_{a_i, \tilde x_{a_i}} - \sum_{i=1}^N \mu^{i}_{s_i, a_i} - \sum_{i=1}^N \kappa_{a_i, x_{a_i}} \geq \sum_{i=1}^N r^{{1}\{i \leq M\}}_{a_i} (x_{a_i}) - c_{s_i a_i} {{1}\{i \leq M\}}.

24: \end{align*}

25: Finally, we add these two inequalities. We obtain

26: \begin{align*}

27: & \sum_{i=1}^N \lambda^{i}_{s_i, x_{s_i}}  - \alpha \sum_{i=1}^N \sum_{\tilde x_{a_i}} p^{{1}\{i \leq M\}}_{x_{a_i} \tilde x_{a_i}} \lambda^{i}_{a_i, \tilde x_{a_i}} \geq \sum_{i=1}^N \left( r^{{1}\{i \leq M\}}_{a_i} (x_{a_i}) - c_{s_i a_i} {{1}\{i \leq M\}} \right),

28: \end{align*}

29: which is the inequality obtained by using the vector (\ref{approximate value}) in the constraints of (\ref{eq: dual exact RBSC}).

30:

31: The case where $s_i=a_i$ for some $i$ is almost identical, considering the constraints (\ref{eq: constraint xs,s,s}) for the corresponding indices.

32: \end{proof}

33:

34:

35: In the following theorem, the occupation measure $F_\alpha(\nu,\tilde{u})$ is a vector of size $|\mathcal{S}|$, representing the discounted infinite horizon frequencies of the states under policy $\tilde u$ and initial distribution $\nu$ \cite{derman-LP}. The proof of the theorem follows from the analysis presented in \cite{deFarias-ADP}, see \cite{leny06_RBSC_CDC} for more details.

36:

37: %, which we define from the state-action frequency simply as

38: %\[

39: %F_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s}) = \sum_{\mathbf{a} \in \Pi_{[N]}} %f_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s},\mathbf{a}), \; \forall (\mathbf{x},\mathbf{s}).

40: %\]

41:

42: \begin{thm}

43: Let $\nu$ be an initial distribution on the states, of the product form (\ref{eq: initial distribution - product}). Let $J^*$ be the optimal reward function, $\tilde J$ be an approximation of this reward function which is feasible for the LP (\ref{eq: dual exact RBSC}), and $\tilde{u}$ be the associated one-step lookahead policy. Let $F_\alpha(\nu,\tilde{u})$ and $J_{\tilde{u}}$ be the occupation measure vector and the expected reward associated to the policy $\tilde{u}$. Then

44:

45:

46: \begin{align} \label{upperBound}

47: \nu^T(J^*-J_{\tilde{u}}) \leq \frac{1}{1-\alpha} F_\alpha(\nu,\tilde{u})^T (\tilde{J}-J^*).

48: \end{align}

49: %or in other words,

50: %\[

51: %\sum_{\mathbf{x},\mathbf{s}} \nu(\mathbf{x},\mathbf{s}) |J^*(\mathbf{x},\mathbf{s}) - J_{\tilde{u}}(\mathbf{x},\mathbf{s})| \leq

52: %\frac{1}{1-\alpha} \sum_{\mathbf{x},\mathbf{s}} F_\alpha(\nu,\tilde{u};\mathbf{x},\mathbf{s}) (\tilde J(\mathbf{x},\mathbf{s}) - J^* (\mathbf{x},\mathbf{s})).

53: %\]

54: \end{thm}

55:

56: From lemma \ref{lem: super-harmonic}, the theorem is true in particular for $\tilde J$ formed according to (\ref{approximate value}). In words, it says that starting with a distribution $\nu$ over the states, the difference in expected rewards between the optimal policy and the one-step lookahead policy is bounded by a weighted $l^1$-distance between the estimate $\tilde{J}$ used in the design of the policy and the optimal value function $J^*$. The weights are given by the occupation measure of the one-step lookahead policy. It provides some motivation to obtain a good approximation $\tilde J$, i.e., a tight relaxation, which was an important element of this paper.

57:

58: %This result is true for every one-step lookahead policy that uses a superharmonic vector as an approximation of the cost-to-go. It gives some motivation to obtain a good approximation $\tilde J$ and a tight relaxation. %It is not a strong performance bound however, since the quantity that we would really like to minimize, $f_\alpha(\nu,\tilde{u})^T (\tilde{J}-J^*)$, depends itself on the approximate policy $\tilde u$ to be designed,

59:

60:

61:

62:

63:

64: