1: % Extended version for Tech report
2: %
3:
4: \documentclass[a4paper]{article}
5: \usepackage[dvips]{graphicx}
6: \usepackage{techrep}
7: \usepackage{mlapa}
8:
9: \renewcommand{\vec}[1]{{\bf#1}}
10: \newcommand{\new}{{\mbox{\tiny new}}}
11: \newcommand{\dd}{\mbox{d}}
12:
13:
14: \author{Ivo Kwee \hspace{8mm} Marcus Hutter \hspace{8mm}
15: J\"{u}rgen Schmidhuber \\
16: % Istituto Dalle Molle sull'Intelligenza Artificiale \\
17: {\normalsize IDSIA, Manno CH-6928, Switzerland.} \\
18: {\normalsize \tt \{ivo,marcus,juergen\}@idsia.ch} }
19: \title{Gradient-based Reinforcement Planning in Policy-Search Methods
20: \footnote{ This is an extended version of the paper presented at the
21: EWRL 2001 in Utrecht (The Netherlands). In this technical report,
22: the derivation steps are presented with more detail, more
23: footnotes, appendices and more (unfinished) ideas. } }
24: \date{November 2001}
25:
26: \reportnumber{14-01}
27: \reportaddnote{This work was supported by SNF grants 21-55409.98 and
28: 2000-61847.00 }
29:
30: \begin{document}
31: \makecover
32: \maketitle
33:
34: %\twocolumn[%\vspace{-3ex}
35: %Proceedings of the ICML-2001
36: %\ewrltitle{Gradient-based Reinforcement Planning in Policy-Search Methods}
37: %\ewrlauthor{Ivo Kwee}{ivo@idsia.ch}
38: %\ewrlauthor{Marcus Hutter}{marcus@idsia.ch}
39: %\ewrlauthor{Juergen Schmidhuber}{juergen@idsia.ch}
40: %\ewrladdress{Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA),
41: % Galleria 2, CH-6928 Manno, Switzerland}
42: %\vskip 0.25in
43: %]
44:
45: \begin{abstract}
46: We introduce a learning method called ``gradient-based reinforcement
47: planning'' (GREP). Unlike traditional DP methods that improve their
48: policy backwards in time, GREP is a gradient-based method that plans
49: ahead and improves its policy \emph{before} it actually acts in the
50: environment. We derive formulas for the exact policy gradient that
51: maximizes the expected future reward and confirm our ideas
52: with numerical experiments.
53: \end{abstract}
54:
55:
56: %========================================================================
57: \section{Introduction}
58: %========================================================================
59:
60: It has been shown that \emph{planning} can dramatically improve
61: convergence in reinforcement learning
62: (RL)~\cite{Schmidhuber:90sandiego,suttonbarto}. However, most RL
63: methods that explicitly use planning that have been proposed are value
64: (or $Q$-value) based methods, such as \emph{Dyna-Q} or
65: \emph{prioritized sweeping}.
66:
67: Recently, much attention is directed to so-called
68: \emph{policy-gradient} methods that improve their policy directly by
69: calculating the derivative of the future expected reward with respect
70: to the policy parameters. Gradient based methods are believed to be
71: more advantageous than value-function based methods in huge state
72: spaces and in POMDP settings. Probably the first gradient based RL
73: formulation is class of REINFORCE algorithms of
74:
75: Williams~\cite{williams92simple}. Other more recent methods are, e.g.,
76: ~\cite{baird98gradient,baxter99direct,sutton00policy}. Our approach of
77: deriving the gradient has the flavor of~\cite{ng99policy} who derive
78: the gradient using future state probabilities.
79:
80: Our novel contribution in this paper is to combine gradient-based
81: learning with explicit planning. We introduce ``gradient-based
82: reinforcement planning'' (GREP) that improves a policy \emph{before}
83: actually interacting with the environment. We derive the formulas for
84: the exact policy gradient and confirm our ideas in numerical
85: experiments. GREP learns the action probabilities of a probabilistic
86: policy for discrete problem. While we will illustrate GREP with a
87: small MDP maze, it may be used for the hidden state in POMDPs.
88:
89: %A key feature of our method is that we use so-called
90: %\emph{adjoint Monte Carlo sampling} which enables us to compute the
91: %gradient of the reward function with respect to policy variables in an
92: %efficient way.
93:
94: %========================================================================
95: \section{Derivation of the policy gradient}
96: %========================================================================
97:
98: %----------------------------------------------------------------------
99: \subsubsection*{projection matrix}
100: %----------------------------------------------------------------------
101: Let us denote the discrete space of \emph{states} by $\mathcal{S} =
102: \{1,...,N\}$. Our belief on $\mathcal{S}$ is decribed by a probability
103: vector $\vec{s}$ of which each element $s_i$ represents the
104: probability of being in state $i$. We also define a set of actions
105: $\mathcal{A} = 1,...,K$. The stochastic policy $\mathcal{P}$ is
106: represented by a matrix $\vec{P}:N \times K$ with elements $P_{ki} =
107: p(a_k|s_i)$, i.e. the conditional probability of action $k$ in
108: state $i$.%
109: \footnote{For computational reasons, one often reparameterizes the
110: policy using a Boltzmann distribution. Here in this paper the
111: probability $p(a_k|s_i)$ is just given by $P_{ki}$ and we do not use
112: reparameterization in order to keep the analysis clear.} %
113: Furthermore, let environment $\mathcal{E}$ be defined by transition
114: matrices $\vec{T}_k$ ($k=1,...,K$) with elements $T_{kji} =
115: p(s_j|s_i,a_k)$, i.e. the transition probability to $s_j$ in state
116: $s_i$ given action $k$.
117:
118: Now we define the \emph{projection matrix} $\vec{F}$ with elements
119: \begin{equation}
120: F_{ji} = \sum_k T_{kji} \, P_{ki}.
121: \label{eq:projection}
122: \end{equation}
123: Important is that matrix $\vec{F}$ is \emph{not} modelling the
124: transition probabilities of the environment, but models the induced
125: transition probability using policy $\mathcal{P}$ in environment
126: $\mathcal{E}$. The induced transition probability $F_{ji}$ is a
127: weighted sum over actions $k$ of the transition probabilities
128: $T_{kij}$ with the policy parameters $P_{ki}$ as the weights.
129:
130:
131: %----------------------------------------------------------------------
132: \subsubsection*{Expected state occupancy}
133: %----------------------------------------------------------------------
134: Using the projection matrix $\vec{F}$, states $\vec{s}_{t}$ and
135: $\vec{s}_{t+1}$ are related as $\vec{s}_{t+1} = \vec{F} \vec{s}_{t}$
136: and therefore $\vec{s}_{t} = \vec{F}^t \vec{s}_0$, where $\vec{s}_0$ is
137: the state probability distribution at $t=0$. We can now define the
138: \emph{expected state occupancy} as
139: \begin{equation}
140: \vec{z} = E[\vec{s}|\vec{s}_0]
141: = \sum_{t=0}^\infty \gamma^t \vec{s}_t
142: = \sum_{t=0}^\infty (\gamma \vec{F})^t \vec{s}_0
143: = (\vec{I} - \gamma \vec{F})^{-1} \vec{s}_0
144: \label{eq:occupancy}
145: \end{equation}
146: where $\gamma$ is a discount factor in order to keep the sum finite. In the
147: last step, we have recognized the sum as the Neumann representation of
148: the inverse. Notice that $\vec{z}$ is a solution of the linear
149: equation
150: \begin{equation}
151: (\vec{I} - \gamma \vec{F}) \vec{z} = \vec{s}_0
152: \label{eq:system}
153: \end{equation}
154: which is just the familiar Bellman equation for the expected occupancy
155: probability $\vec{z}$.
156:
157: %%[IVO: huh??!!]
158: %Let us also define
159: %\begin{equation}
160: % \mathcal{K}(\vec{F}) \,|\, : \vec{s}_0 \mapsto \vec{z}
161: %\end{equation}
162: %as the Bellman operator that maps and
163: %\begin{equation}
164: % \mathcal{K}^*: \vec{z} \mapsto (\vec{F},\vec{s}_0)
165: %\end{equation}
166:
167: %----------------------------------------------------------------------
168: \subsubsection*{Expected reward function}
169: %----------------------------------------------------------------------
170: In \emph{reinforcement learning} (RL) the objective is to maximize
171: future reward. We define a reward vector $\vec{r}$ in the same domain
172: as $\vec{s}$. Using the expected occupancy $\vec{z}$ the future
173: expected reward $H$ is simply
174: \begin{equation}
175: H = \langle\vec{r}, \vec{z} \rangle
176: \label{eq:rl-error}
177: \end{equation}
178: where $\langle \cdot,\cdot \rangle$ is the scalar vector product.%
179: %
180: \footnote{ In \emph{optimal control} (OC) we want to reach some target
181: state under some optimality conditions (mostly minimum time or
182: minimum energy). We denote $\vec{r}$ as our target distribution and
183: denote the time-to-arrival as $t^*$. If $t^*$ were known beforehand
184: then we
185: \begin{equation}
186: H = \frac{1}{2} ( \vec{r} - \vec{s}_{t^*} )^2
187: = \frac{1}{2} ( \vec{r} - \vec{F}^{t^*} \vec{s}_0 )^2
188: \end{equation}
189:
190: However the exact arrival time is mostly not known beforehand, the
191: most we can do is to use minimize the (time-weighted) expected error
192: \begin{equation}
193: H = \sum_{t^*=0}^\infty \gamma^{t^*}
194: \frac{1}{2} ( \vec{r} - \vec{F}^{t^*} \vec{s}_0 )^2
195: = \frac{1}{2} k \vec{r}^2
196: - \vec{r} \vec{z} + \frac{1}{2} \sum_{t^*=0}^\infty \gamma^{t^*}
197: \left( \vec{F}^{t^*} \vec{s}_0 \right)^2
198: \label{eq:control-error}
199: \end{equation}
200: The first term on the right hand side is often not relevant because it
201: is independent of $\vec{F}$. When we compare
202: Eq.~\ref{eq:control-error} with Eq.~\ref{eq:rl-error}, we see that the
203: OC error function has a quadratic term in $\vec{F}$ which the RL error
204: lacks.
205: }
206:
207: Because $\vec{z}$ is a solution of Eq.~\ref{eq:system} it is dependent
208: on $\vec{F}$ which in turn depends on policy $\vec{P}$. Given
209: $\vec{r}$ and $\vec{s}_0$, our task is to find the optimal $\vec{P}^*$
210: such that $H$ is maximized, i.e.
211: \begin{equation}
212: \vec{P}^* = \arg \max_{\vec{P}} H .
213: \end{equation}
214:
215: We can regard the calculation of the future expected reward as a
216: composition of two operators
217: \begin{equation}
218: \mathcal{Q}: \vec{F} \mapsto \vec{z}
219: \end{equation}
220: which maps the transition matrix $\vec{F}$ to the expected occupancy
221: probabilities $\vec{z}$, and
222: \begin{equation}
223: \mathcal{R}: \vec{z} \mapsto R
224: \end{equation}
225: which maps the probabilities $\vec{z}$, given a reward distribution
226: $\vec{r}$, to an expected reward value $R$.
227:
228: %----------------------------------------------------------------------
229: \subsubsection*{Calculation of the policy gradient}
230: %----------------------------------------------------------------------
231: A variation $\delta \vec{z}$ in the expected occupancy can be related
232: to first order to a perturbation $\delta \vec{P}$ in the (stochastic)
233: policy. To obtain the partial derivatives $\partial \vec{z} / \partial
234: P_{ik}$, we differentiate Eq.~\ref{eq:system} with respect to $P_{ik}$
235: and obtain:
236: \begin{equation}
237: - \gamma \frac{\partial \vec{F}}{\partial P_{ik}} \vec{z}
238: + (\vec{I} - \gamma \vec{F})\frac{\partial \vec{z}}{\partial P_{ik}}
239: = 0.
240: \label{eq:diff1}
241: \end{equation}
242: The right hand side of the equation is zero because we assume that
243: $\vec{s}_0$ is independent of $P_{ik}$.
244: Rearranging gives:
245: \begin{equation}
246: \frac{\partial \vec{z} }{\partial P_{ik} } =
247: \gamma \vec{K} \frac{\partial \vec{F}}{\partial P_{ik} } \vec{z}
248: \label{eq:diff2}
249: \end{equation}
250: where $\vec{K} = (\vec{I} - \gamma \vec{F})^{-1}$. %
251:
252: From Eq.~\ref{eq:rl-error} and Eq.~\ref{eq:diff1}, together with the
253: chain rule, we obtain the gradient of the RL error with respect to the
254: policy parameters $P_{ik}$:
255: \begin{equation}
256: \frac{ \partial H } { \partial P_{ik} } =
257: \frac{ \partial H } { \partial \vec{z} }
258: \frac{ \partial \vec{z} } { \partial P_{ik} }
259: = \left\langle \vec{r}, \gamma \vec{K} \frac{\partial \vec{F}}
260: {\partial P_{ik}} \vec{z} \right\rangle
261: = \gamma \left\langle \vec{K}^*\vec{r}, \frac{\partial \vec{F}}
262: {\partial P_{ik}} \vec{z} \right\rangle
263: \label{eq:gradient}
264: \end{equation}
265: where $A^*$ means the adjoint operator of $A$ defined by $\langle u, A
266: w \rangle = \langle A^*u, w \rangle$. Let us define:
267: $\vec{q}=\vec{K}^*\vec{r}$. While $\vec{K}$ maps the initial state
268: $\vec{s}_0$ to the future expected state occupancy $\vec{z}$, its
269: adjoint, $\vec{K}^*$, maps the reward vector $\vec{r}$ back to
270: \emph{expected reward} $\vec{q}$. The value of $q_i$ represents the
271: (pointwise) expected reward in state
272: $s_i$ for policy $\vec{P}$.~%
273: \footnote{Indeed, this is a different way to define the traditional
274: \emph{value-function}. Note that generally, neither $\vec{r}$ nor
275: $\vec{q}$ are probabilities because their 1-norm is generally not 1.
276: }
277:
278: Finally, differentiating Eq.~\ref{eq:projection} gives us
279: $\partial \vec{F} / \partial P_{ik}$. Inserting this into
280: Eq.~\ref{eq:gradient} yields:
281: \begin{equation}
282: G_{ik} = \frac{ \partial H } { \partial P_{ik} }
283: \propto \, z_i \! \sum_j T_{kji} q_j.
284: \label{eq:gradient-p}
285: \end{equation}
286: In words, the gradient of $H$ with respect to policy parameter
287: $P_{ik}$ (i.e. the probability of taking action $a_k$ in state $s_i$)
288: is proportional to the expected occupancy $z_i$ times the weighted sum
289: of expected reward $q_j$ over next states $(j=1,...,N)$ weighted by
290: the transition probabilities $T_{kji}$.
291:
292: Note that the gradient could also have been approximated using finite
293: differences which would need at least $1+n^2$ field
294: calculations.\footnote{ Finite difference approximation of the
295: derivative $\partial \vec{z} / \partial T_{ij}$ involves computing
296: $\vec{z}$ for $\vec{F}$ and then perturbing a single $T_{ij}$ in
297: $\vec{F}$ by a tiny amount $dT$ and subsequently recomputing
298: $\vec{z}'$. Then the derivative is approximated by $\partial \vec{z}
299: / \partial T_{ij} \approx (\vec{z}'-\vec{z})/dT$. For a $n\times n$
300: matrix $\vec{F}$, one would need to repeat this for every element
301: and would require a total upto $1+n^2$ calculations of $\vec{z}$.}
302: The adjoint method is much more efficient and needs only \emph{two}
303: field calculations.
304:
305: %\paragraph{Optimal control}
306: %The OC error has an additional quadratic term in $\vec{F}$
307: %\begin{equation}
308: %\frac{\partial} { \partial T_{ij} } =
309: %\sum_{t^*=0}^\infty t^* \gamma^{t^*} \vec{F}^{2t^*-1} \vec{s}_0
310: %\frac{\partial \vec{F}} { \partial T_{ij} }\vec{s}_0
311: %\end{equation}
312:
313: Once we have the gradient $\vec{G}$, improving policy $\vec{P}$ is now
314: straight forward using gradient ascent or we can also use more
315: sophisticated gradient-based methods such as nonlinear conjugate
316: gradients (as in~\cite{baxter99direct}). The optimization is nonlinear
317: because $\vec{z}$ and $\vec{r}$ themselves depend on the current
318: estimate of $\vec{P}$.
319:
320: %========================================================================
321: \section{Computation of the optimal policy}
322: %========================================================================
323:
324: We will introduce two algorithms that incorporate our ideas of
325: gradient-based reinforcement planning. The first algorithms describes
326: an off-line planning algorithm that finds the optimal policy but
327: assumes that the environment transition probabilities are known. The
328: second algorithm is an online version that could cope with unknown
329: environments.
330:
331: %----------------------------------------------------------------------
332: \subsection{Offline GREP}
333: %----------------------------------------------------------------------
334:
335: If the environment transition probabilities $T_{kji}$ are known, the
336: agent may improve its policy using GREP. Our offline GREP planning
337: algorithm consist of two steps:
338:
339: \begin{enumerate}
340: \item {\em Plan ahead:} Compute the policy gradient $\vec{G}$ in
341: Eq.~\ref{eq:gradient} and improve current policy
342: \begin{equation}
343: \vec{P} \leftarrow \vec{P} + \alpha \vec{G}
344: \label{eq:t-update}
345: \end{equation}
346: where $\alpha$ is a suitable step size parameter; for efficiency we
347: can also perform a linesearch on $\alpha$.
348: \item {\em Evaluate policy:} Repeat above until policy is optimal.
349: \end{enumerate}
350:
351: Matrix $\vec{P}$ describes a probabilistic policy. We define the
352: \emph{maximum probable policy} (MPP) to be the deterministic policy by
353: taking the maximum probable action at each state. It is not obvious
354: that the MPP policy will converge to the global optimal solution but
355: we expect MPP at least to be near-optimal.
356:
357: %========================================================================
358: \subsubsection*{Numerical experiments}
359: %========================================================================
360:
361: \begin{figure} \centering
362: \includegraphics[width=0.147\textwidth]{maze.eps}
363: % \includegraphics[width=0.155\textwidth]{z.epsi}
364: % \includegraphics[width=0.155\textwidth]{q.epsi}
365: \includegraphics[width=0.155\textwidth]{zt.eps}
366: \includegraphics[width=0.155\textwidth]{qt.eps}
367: \caption{Left: $10\times10$ toy maze with start at left and goal at right side.
368: Center: plot of expected occupancy $\vec{z}$. Right: plot of
369: expected reward $\vec{q}$. White corresponds to higher probability.
370: [Blurring is due to visualisation only]. }
371: \label{fig:maze}
372: \end{figure}
373:
374: \begin{figure*} \centering
375: \includegraphics[width=0.45\textwidth]{pol1}\hfill
376: \includegraphics[width=0.45\textwidth]{pol2}
377: \caption{%
378: Plot of simulated path length versus GREP iteration of a small toy
379: MDP maze for the probability weighted (PW) policy, annealed PW
380: policy and MPP policy. The shortest path to goal is 14. Left:
381: starting from initial uniform policy. Right: starting from initial
382: random policy. }
383: \label{fig:reward}
384: \end{figure*}
385:
386: We performed some numerical experiments using offline GREP. Our test
387: problem was a pure planning task in a $10\times10$ toy maze (see
388: Fig.~\ref{fig:maze}) where the probabilistic policy $\vec{P}$
389: represents the probability of taking a certain action at a certain
390: maze position. The same figure also shows typical solutions for the
391: quantities $\vec{z}$ and $\vec{q}$, i.e. the expected occupancy and
392: expected reward respectively (for certain $\vec{P}$).
393:
394: After each GREP iteration, i.e. after each gradient calculation and
395: $\vec{P}$ update, we checked the obtained policy by running 20
396: simulations using the current value of $\vec{P}$. The probability
397: weighted (PW) policy selects action $k$ at state $i$ proportional to
398: $P_{ik}$, while the annealed PW policy uses an annealing factor of
399: $T=4$; we also simulated the MPP solution. Figure~\ref{fig:reward}
400: shows the average simulated path length versus GREP iteration of the
401: PW, the annealed PW policy and the derived MPP policy. In the left
402: plot the initial policy $\vec{P}$ was taken uniform. The right plot in
403: the same figure shows the simulated path lengths from a random policy;
404: also here the MPP finds the optimal solution but slightly later.
405:
406: We see from the figure that in both cases the probability-weighted
407: (PW) policy is improving during the GREP iterations. However, the
408: convergence is very slow which shows the severe non-linearity of the
409: problem. The annealed PW does perform better than PW. Finally, we see
410: that MPP finds the optimal solution quickly within a few iterations.
411: Using Dijkstra's method, we confirmed that the found MPP policy was in
412: agreement with the global shortest path solution.
413:
414: %----------------------------------------------------------------------
415: \subsection*{Online GREP}
416: %----------------------------------------------------------------------
417: The account below desribe an idea to use GREP when the environment is
418: not known beforehand\footnote{At the time of writing, we have not
419: implemented this idea yet}. The steps actually interleave ``Kalman
420: filter''-like estimation of the unknown environment transition
421: probabilities with the explicit planning of GREP. In fact, it also
422: includes a step to estimate a possibly unknown (linear) sensor
423: mapping. Apart from the policy matrix $\vec{P}$, we need to estimate
424: also the (environment) transition probabilities $T_{kji}$ and possibly
425: sensor matrix $\vec{B}$. We can optimize for all parameters by
426: iteratively ascending to their conditional mode. The conditional
427: maximizing steps are easy:
428: %
429: \begin{enumerate}
430: \item {\em Plan ahead:} Compute the policy gradient $\vec{G}$ in
431: Eq.~\ref{eq:gradient} and improve current policy
432: \begin{equation}
433: \vec{P} \leftarrow \vec{P} + \alpha \vec{G}
434: \label{eq:t-update}
435: \end{equation}
436: where $\alpha$ is a suitable step size parameter; for efficiency we
437: can also perform a linesearch. After, we need to renormalize the
438: columns of $\vec{P}$. See note below on policy regularization.
439:
440: \item {\em Select action:} Given state estimate $\vec{s}_t$, draw an
441: action $k$ from the policy according to:
442: \[ k \sim \vec{P}_{\!t} \, \vec{s}_t. \]
443: and receive reward $R$ and estimate new state $\vec{s}_{t+1}$.
444:
445: %%^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
446: %% IVO: leaving out stuff below because not directly relevant in
447: %% THIS paper. Very interesting though; NEXT paper?!
448: %%^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
449: \item {\em Estimate state:} Observe $\vec{y}_{t+1}$ and estimate
450: $\vec{s}_{t+1}$ using
451: \begin{equation}
452: p(\vec{s}_{t+1} | \vec{y}_{t+1}, \vec{s}_{t}, k )
453: \; \propto \;
454: p(\vec{y}_{t+1} | \vec{s}_{t+1}) \,
455: p(\vec{s}_{t+1} | \vec{F}_k, \vec{s}_{t} ).
456: \end{equation}
457: Assuming Gaussian noise for observations and state estimates we
458: obtain:
459: \begin{equation}
460: \vec{s}_{t+1} = \frac{
461: \vec{H}_s \vec{F}_k, \vec{s}_{t} +
462: \vec{B}^{-1} \vec{H}_y \vec{y}_{t+1} }
463: { \vec{H}_s + \vec{B}^{-1} \vec{H}_y \vec{B}^{-T} }
464: \end{equation}
465: where $\vec{B}$ is the \emph{sensor matrix} that maps internal state
466: $\vec{s}$ to observations $\vec{y}$. Matrices $\vec{H}_s$ and
467: $\vec{H}_y$ are the inverse covariance (or so called
468: \emph{precisions}) of state $\vec{s}$ and observation $\vec{y}$
469: respectively.
470:
471: \item {\em Estimate sensors:} In case also sensor matrix $\vec{B}$ is
472: unknown, we have to perform an additional estimation step for
473: $\vec{B}$. This is common step in the standard Kalman formulation.
474:
475: \item {\em Estimate environment:} Given action $k$ and reward $R$, we
476: update the reward vector
477: \begin{equation}
478: r_j \leftarrow R
479: \end{equation}
480: and the environment transition probabilities
481: \begin{equation}
482: \dd \vec{F}_k \;\propto\; (\vec{s}_{t+1} - \vec{F}_k \vec{s}_t )
483: \vec{s}_t^T (\vec{s}_t \vec{s}_t^T)^{-1},
484: \end{equation}
485: or if $(\vec{s}_t \vec{s}_t^T)^{-1}$ does not exist we can use
486: \begin{equation}
487: \dd \vec{F}_k \;\propto\; (\vec{s}_{t+1} - \vec{F}_k \vec{s}_t )
488: (\vec{s}_t^T \vec{s}_t)^{-1} \vec{s}_t^T.
489: \end{equation}
490: where $\vec{s}_t = j$, and reestimate the environment transition
491: probabilities
492: \begin{equation}
493: \vec{T}_{k} \leftarrow
494: \vec{T}_{k} + (\vec{s}_{t+1} - \vec{T}_k \vec{s}_t ) \vec{s}_t^T.
495: \end{equation}
496: After the update, one should set entries in $\vec{F}'_k$ that
497: corresponds to physically impossible transitions to zero.
498: After, we need to renormalize the columns of $\vec{T}_k$. It is
499: important to note that given $\vec{s}_{t}$ and $\vec{s}_{t+1}$
500: transition matrix $\vec{T}_k$ is conditionally independent of the
501: policy $\vec{P}$. That is to say, we can obtain an accurate model of
502: the environment using, e.g., just a random walk.
503: \item Repeat 1.
504: \end{enumerate}
505:
506: To draw a picture of what is happening. In the planning stage, based
507: on the current (and maybe not accurate) environment model, the agent
508: tries to improve its current policy by planning ahead using the
509: gradient in Eq.~\ref{eq:gradient-p}. Remember that the gradient
510: involves simulating paths from the current state and adjoint paths
511: from the goal. In the action stage the agent samples an action from
512: its policy. Then the agent senses the new state and updates its
513: environment model using this new information. Notice that policy
514: improvement is not done ``backwards'' as traditionally is done in DP
515: methods but ``forward'' by planning ahead.
516:
517: %========================================================================
518: \section{Conclusions}
519: %========================================================================
520:
521: \subsubsection*{Future topics}
522: We have tacitly assumed that $\vec{z}$ and $\vec{q}$ are computed
523: using the same discount factor $\gamma$. However, we could introduce
524: separate parameters $\gamma_z$ and $\gamma_q$ which effectively
525: assigns a different ``forward time window'' for $\vec{z}$ and a
526: ``backward time window'' for $\vec{q}$. In fact when $\gamma_z
527: \rightarrow 0$ we have a ``one-step-look-ahead''. Alternatively, in
528: the limit of $\gamma_q \rightarrow 0$ we obtain a gradient for a
529: greedy policy that maximize only ``immediate reward''. How both
530: parameters affect GREP's performance is a topic for future research.
531:
532: The above suggests that GREP can be viewed as a generalization to
533: ``one-step-look-ahead'' policy improvement. In fact, a
534: ``one-step-look-ahead'' improvement rule using can be obtained for
535: $\gamma_z \rightarrow 0$ simply by taking $\vec{z}=\vec{s}_t$ in
536: Eq.~\ref{eq:gradient-p}. Such an approach would be ``policy greedy''
537: in a sense that it updates the policy only locally. We expect GREP to
538: perform better because it updates the policy more globally; whether
539: this in fact improves GREP is also a remaining issue for future
540: research.
541:
542: The interleaving of GREP with a Kalman-like estimation procedure of
543: the environment could handle a variety of interesting problems such as
544: planning in POMDP environments.
545:
546: We must mention that appropriate reparameterization of the stochastic
547: policy, e.g. using a Boltzman distribution, could improve the
548: convergence. We have not pursued this further.
549:
550: \subsubsection*{Summary}
551: We have introduced a learning method called ``gradient-based
552: reinforcement planning'' (GREP). GREP needs a model of the
553: environment and plans ahead to improve its policy \emph{before} it
554: actually acts in the environment. We have derived formulas for the
555: exact policy gradient.
556:
557: Numerical experiments suggest that the probabilistic policy indeed
558: converges to an optimal policy---but quite slowly. We found that (at
559: least in our toy example) the optimal solution can be found much
560: faster by annealing or simply by taking the most probable action at
561: each state.
562:
563: Further work will be to incorporate GREP in online RL learning tasks
564: where the environment parameters, i.e. transition probabilities
565: $T_{kji}$, are unknown and have to be learned. While an analytical
566: solution for $\vec{q}$ and $\vec{z}$ are only viable for small problem
567: sizes, for larger problems we probably need to investigate Monte Carlo
568: or DP methods.%
569:
570: %========================================================================
571: %\bibliographystyle{plain}
572: %\bibliographystyle{mlapa}
573: {\small
574: %\bibliography{grep}
575: \begin{thebibliography}{}
576:
577: \bibitem[Baird, 1998][Baird][1998]{baird98gradient}
578: Baird, L.~C. (1998).
579: \newblock Gradient descent for general reinforcement learning.
580: \newblock {\em Advances in Neural Information Processing Systems}.
581: \newblock {MIT} Press.
582:
583: \bibitem[Baxter \& Bartlett, 1999][Baxter and Bartlett][1999]{baxter99direct}
584: Baxter, J., \& Bartlett, P. (1999).
585: \newblock {\em Direct gradient-based reinforcement learning: I. gradient
586: estimation algorithms} (Technical Report).
587: \newblock Research School of Information Sciences and Engineering, Australian
588: National University.
589:
590: \bibitem[Difilippo et~al.\/, 1996][Difilippo et~al.\/][1996]{difilippo}
591: Difilippo, F.~C., Goldstein, M., Worley, B.~A., \& Ryman, J.~C. (1996).
592: \newblock Adjoint {M}onte {C}arlo methods for radiotherapy treatment planning.
593: \newblock {\em Trans. Am. Nucl. Soc.}, {\em 74}, 14--16.
594:
595: \bibitem[Ng et~al.\/, 1999][Ng et~al.\/][1999]{ng99policy}
596: Ng, A., Parr, R., \& Koller, D. (1999).
597: \newblock Policy search via density estimation.
598: \newblock {\em Advances in Neural Information Processing Systems}.
599: \newblock {MIT} Press.
600:
601: \bibitem[Schmidhuber, 1990][Schmidhuber][1990]{Schmidhuber:90sandiego}
602: Schmidhuber, J. (1990).
603: \newblock An on-line algorithm for dynamic reinforcement learning and planning
604: in reactive environments.
605: \newblock {\em Proc. IEEE/INNS International Joint Conference on Neural
606: Networks, San Diego} (pp.\/ 253--258).
607:
608: \bibitem[Sutton \& Barto, 1998][Sutton and Barto][1998]{suttonbarto}
609: Sutton, R.~S., \& Barto, A.~G. (1998).
610: \newblock {\em Reinforcement learning. {A}n introduction}.
611: \newblock {MIT} Press, Cambridge.
612:
613: \bibitem[Sutton et~al.\/, 2000][Sutton et~al.\/][2000]{sutton00policy}
614: Sutton, R.~S., McAllester, D., Singh, S., \& Mansour, Y. (2000).
615: \newblock Policy gradient methods for reinforcement learning with function
616: approximation.
617: \newblock {\em Advances in Neural Information Processing Systems}.
618: \newblock {MIT} Press.
619:
620: \bibitem[Williams, 1992][Williams][1992]{williams92simple}
621: Williams, R.~J. (1992).
622: \newblock Simple statistical gradient-following algorithms for connectionist
623: reinforcement learning.
624: \newblock {\em Machine Learning}, {\em 8}, 229--256.
625:
626: \end{thebibliography}
627:
628: }
629:
630:
631: %========================================================================
632: %========================================================================
633: %========================================================================
634:
635:
636: %========================================================================
637: \section*{Appendix A: Implicit policies}
638: %========================================================================
639: In deterministic environments where the state-action pair $(s_i,a_m)$
640: uniquely leads to a state $s_j$, i.e. $T_{kji} = \delta_{km}
641: (k=1,...,K)$ the projection $\vec{F}$ is solely determined
642: by the policy $\vec{z}$, and \emph{vice versa}. We refer to this as
643: the case of \emph{implicit policy} because the policy is implicitly
644: implied in the induced transition probability $T_{ji}$.
645:
646: In such environments we can suffice to solve for $\vec{F}$ directly
647: and omit parameterization through $\vec{z}$. From
648: Eq.~\ref{eq:projection} we see that
649: \begin{equation}
650: P_{im} = T_{ji}
651: \end{equation}
652: and using a similar derivation as we have done for $\partial H /
653: \partial P_{ik}$, it can be shown that the gradient of $H$ with
654: respect to $\vec{F}$ is given by
655: \begin{equation}
656: \vec{G} = \vec{r} \vec{z}^T.
657: \label{eq:rank-one}
658: \end{equation}
659: An important point must be mentioned. In most cases many elements
660: $T_{ji}$ are zero, representing an absent transition between $S_i$ and
661: $S_j$. Naively updating $\vec{F}$ using the full gradient $\vec{G}$
662: would incur complete fill-in of $\vec{F}$ which is in most cases not
663: desirable or even physically incorrect. Therefore, one must check the
664: gradient each time and set impossible transition probabilities to
665: zero. We will refer to this ``heuristically corrected'' gradient as
666: $\widetilde{\vec{G}}$. Also, after each update, we have to renormalize
667: the columns of $\vec{F}$. The rank-one update in Eq.~\ref{eq:rank-one}
668: is interesting because it provides an efficient means of calculating
669: the inverse in Eq.~\ref{eq:occupancy}.
670:
671: %% [IVO: Hmm... not proven.... leave out??!!]
672: %%
673: %Then, writing $\vec{F}' = \vec{F} + \alpha
674: %\vec{r}\vec{z}^T$ and $\vec{B} = \vec{I} - \gamma \vec{F}$ we have:
675: %\begin{eqnarray}
676: % \vec{z}'
677: % &=& ( \vec{I} - \gamma \vec{F}' )^{-1} \vec{s}_0
678: % \; = \; ( \vec{B} - \beta \vec{r} \vec{z}^T )^{-1} \vec{s}_0 \nonumber \\
679: % &=& \vec{z} + \frac{ \vec{B}^{-1} \widetilde{\vec{G}} }
680: % { \beta^{-1} - \mbox{tr}\,\vec{B}^{-1} \widetilde{\vec{G}} } \, \vec{z}
681: %\end{eqnarray}
682: %where we used the rank-one update formula for the inverse:
683: %\begin{equation}
684: % \left(\vec{F} + \vec{u} \vec{v}^T \right)^{-1} =
685: % \vec{F}^{-1} - \frac{ \vec{F}^{-1} \vec{u} \vec{v}^T \vec{F}^{-1} }
686: % { 1 + \vec{v}^T \vec{F}^{-1} \vec{u} }
687: %\end{equation}
688: %provided that $1 + \vec{v}^T \vec{F}^{-1} \vec{u} \neq 0$.
689:
690:
691: %========================================================================
692: \section*{Appendix B: Monte Carlo gradient sampling}
693: %========================================================================
694:
695: In our example, we calculated $\vec{z}$ and $\vec{q}$ in
696: Eq.\ref{eq:occupancy} by linear programming. For large state spaces
697: the matrix inversion quickly becomes too computationally intensive and
698: probably traditonal dynamic programming based methods would be more
699: efficient.
700:
701: Instead, we investigated to use Monte Carlo (MC) simulation. We use
702: \emph{forward sampling} to approximate the expected state occupancies
703: in $\vec{z}$ and use, so-called, \emph{adjoint Monte Carlo
704: sampling}~\cite{difilippo} to estimate the adjoint reward $\vec{q}$.
705: Adjoint MC simulation of is far more efficient than would
706: we have estimated each $q_i$ by a separate MC run.%
707: \footnote{With one MC run we mean performing, e.g., 10000 trials from
708: a fixed state.} %
709: By performing the simulation backward from $\vec{r}$, we obtain all
710: values of $\vec{q}$ using only a single MC run.
711:
712: \begin{figure} \centering
713: \includegraphics[height=0.28\textwidth]{f6} \hfill
714: \includegraphics[height=0.28\textwidth]{f6a} \hfill
715: \includegraphics[height=0.28\textwidth]{g6} \\
716: \hspace{24mm} (a) \hfill (b) \hfill (c) \hspace{20mm}
717: \caption{ \it MC calculations of a $6\times6$ toy maze. The agent is
718: at (0,1) and targets (5,4). (a) Expected occupancy, (b) adjoint
719: probability, and (c) normalized policy gradient for $n=4$, $n=40$,
720: $n=200$. Each vector is computed as $\vec{v} = \sum_k P_{ik}
721: \vec{e}_k$ where $\vec{e}_k$ is the unit vector along the state
722: change induced by action $a_k$.}
723: \label{fig:mc-gradientplots}
724: \end{figure}
725:
726: Fig.~\ref{fig:mc-gradientplots} shows the MC approximations of
727: $\vec{z}$ and $\vec{q}$. On the right of the same figure, we have
728: plotted the computed policy-gradient based on MC estimates using a
729: minimum number of $n=\{20, 40, 200\}$ samples. To compare them with
730: the exact gradient, we calculated the exact values of $\vec{z}$ and
731: $\vec{q}$ by inverting the linear system. For larger number of
732: samples, the gradient vector do indeed point more strongly towards the
733: goal.
734:
735: An important feature of general Monte Carlo methods is that they
736: automatically concentrate their sampling to the important regions of
737: the parameter space ---mostly proportional to the posterior or the
738: likelihood. For our purpose of sampling the gradient, to even more
739: concentrate the sampling density towards the regions of large gradient
740: values, we have tried to apply \emph{annealing}. To sample from a
741: density $p(\theta)$ we may sample from the annealed function
742: $p_\gamma(\theta) = p(\theta)^\gamma / \int p(\theta)^\gamma d\theta$
743: and reweight each sample with its importance weight
744: $1/p_\gamma(\theta)$. For $\gamma \rightarrow \infty$, the set of
745: samples converges to the maximum probable gradient.
746:
747: In conclusion, our approach of separately estimating $\vec{q}$ and
748: $\vec{z}$ using MC and \emph{then} (elementwise) multiply their
749: solutions, doesn't really brought clear advantages. If we could sample
750: from the joint distribution $q_i z_i$ (i.e. elementwise product) then
751: MC would clearly turn out to be a very efficient method.
752:
753: \end{document}
754:
755:
756:
757:
758:
759: