0111:cs0111060/grep2.tex

1: % Extended version for Tech report

2: %

3:

4: \documentclass[a4paper]{article}

5: \usepackage[dvips]{graphicx}

6: \usepackage{techrep}

7: \usepackage{mlapa}

8:

9: \renewcommand{\vec}[1]{{\bf#1}}

10: \newcommand{\new}{{\mbox{\tiny new}}}

11: \newcommand{\dd}{\mbox{d}}

12:

13:

14: \author{Ivo Kwee \hspace{8mm} Marcus Hutter \hspace{8mm}

15:   J\"{u}rgen Schmidhuber \\

16: %  Istituto Dalle Molle sull'Intelligenza Artificiale \\

17:   {\normalsize IDSIA, Manno CH-6928, Switzerland.}  \\

18:   {\normalsize \tt \{ivo,marcus,juergen\}@idsia.ch} }

19: \title{Gradient-based Reinforcement Planning in Policy-Search Methods

20:   \footnote{ This is an extended version of the paper presented at the

21:     EWRL 2001 in Utrecht (The Netherlands). In this technical report,

22:     the derivation steps are presented with more detail, more

23:     footnotes, appendices and more (unfinished) ideas. } }

24: \date{November 2001}

25:

26: \reportnumber{14-01}

27: \reportaddnote{This work was supported by SNF grants 21-55409.98 and

28:   2000-61847.00 }

29:

30: \begin{document}

31: \makecover

32: \maketitle

33:

34: %\twocolumn[%\vspace{-3ex}

35: %Proceedings of the ICML-2001

36: %\ewrltitle{Gradient-based Reinforcement Planning in Policy-Search Methods}

37: %\ewrlauthor{Ivo Kwee}{ivo@idsia.ch}

38: %\ewrlauthor{Marcus Hutter}{marcus@idsia.ch}

39: %\ewrlauthor{Juergen Schmidhuber}{juergen@idsia.ch}

40: %\ewrladdress{Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA),

41: %             Galleria 2, CH-6928 Manno, Switzerland}

42: %\vskip 0.25in

43: %]

44:

45: \begin{abstract}

46:   We introduce a learning method called ``gradient-based reinforcement

47:   planning'' (GREP). Unlike traditional DP methods that improve their

48:   policy backwards in time, GREP is a gradient-based method that plans

49:   ahead and improves its policy \emph{before} it actually acts in the

50:   environment. We derive formulas for the exact policy gradient that

51:   maximizes the expected future reward and confirm our ideas

52:   with numerical experiments.

53: \end{abstract}

54:

55:

56: %========================================================================

57: \section{Introduction}

58: %========================================================================

59:

60: It has been shown that \emph{planning} can dramatically improve

61: convergence in reinforcement learning

62: (RL)~\cite{Schmidhuber:90sandiego,suttonbarto}. However, most RL

63: methods that explicitly use planning that have been proposed are value

64: (or $Q$-value) based methods, such as \emph{Dyna-Q} or

65: \emph{prioritized sweeping}.

66:

67: Recently, much attention is directed to so-called

68: \emph{policy-gradient} methods that improve their policy directly by

69: calculating the derivative of the future expected reward with respect

70: to the policy parameters. Gradient based methods are believed to be

71: more advantageous than value-function based methods in huge state

72: spaces and in POMDP settings. Probably the first gradient based RL

73: formulation is class of REINFORCE algorithms of

74:

75: Williams~\cite{williams92simple}. Other more recent methods are, e.g.,

76: ~\cite{baird98gradient,baxter99direct,sutton00policy}. Our approach of

77: deriving the gradient has the flavor of~\cite{ng99policy} who derive

78: the gradient using future state probabilities.

79:

80: Our novel contribution in this paper is to combine gradient-based

81: learning with explicit planning.  We introduce ``gradient-based

82: reinforcement planning'' (GREP) that improves a policy \emph{before}

83: actually interacting with the environment. We derive the formulas for

84: the exact policy gradient and confirm our ideas in numerical

85: experiments. GREP learns the action probabilities of a probabilistic

86: policy for discrete problem. While we will illustrate GREP with a

87: small MDP maze, it may be used for the hidden state in POMDPs.

88:

89: %A key feature of our method is that we use so-called

90: %\emph{adjoint Monte Carlo sampling} which enables us to compute the

91: %gradient of the reward function with respect to policy variables in an

92: %efficient way.

93:

94: %========================================================================

95: \section{Derivation of the policy gradient}

96: %========================================================================

97:

98: %----------------------------------------------------------------------

99: \subsubsection*{projection matrix}

100: %----------------------------------------------------------------------

101: Let us denote the discrete space of \emph{states} by $\mathcal{S} =

102: \{1,...,N\}$. Our belief on $\mathcal{S}$ is decribed by a probability

103: vector $\vec{s}$ of which each element $s_i$ represents the

104: probability of being in state $i$. We also define a set of actions

105: $\mathcal{A} = 1,...,K$. The stochastic policy $\mathcal{P}$ is

106: represented by a matrix $\vec{P}:N \times K$ with elements $P_{ki} =

107: p(a_k|s_i)$, i.e. the conditional probability of action $k$ in

108: state $i$.%

109: \footnote{For computational reasons, one often reparameterizes the

110:   policy using a Boltzmann distribution. Here in this paper the

111:   probability $p(a_k|s_i)$ is just given by $P_{ki}$ and we do not use

112:   reparameterization in order to keep the analysis clear.} %

113: Furthermore, let environment $\mathcal{E}$ be defined by transition

114: matrices $\vec{T}_k$ ($k=1,...,K$) with elements $T_{kji} =

115: p(s_j|s_i,a_k)$, i.e. the transition probability to $s_j$ in state

116: $s_i$ given action $k$.

117:

118: Now we define the \emph{projection matrix} $\vec{F}$ with elements

119: \begin{equation}

120:   F_{ji} = \sum_k T_{kji} \, P_{ki}.

121: \label{eq:projection}

122: \end{equation}

123: Important is that matrix $\vec{F}$ is \emph{not} modelling the

124: transition probabilities of the environment, but models the induced

125: transition probability using policy $\mathcal{P}$ in environment

126: $\mathcal{E}$. The induced transition probability $F_{ji}$ is a

127: weighted sum over actions $k$ of the transition probabilities

128: $T_{kij}$ with the policy parameters $P_{ki}$ as the weights.

129:

130:

131: %----------------------------------------------------------------------

132: \subsubsection*{Expected state occupancy}

133: %----------------------------------------------------------------------

134: Using the projection matrix $\vec{F}$, states $\vec{s}_{t}$ and

135: $\vec{s}_{t+1}$ are related as $\vec{s}_{t+1} = \vec{F} \vec{s}_{t}$

136: and therefore $\vec{s}_{t} = \vec{F}^t \vec{s}_0$, where $\vec{s}_0$ is

137: the state probability distribution at $t=0$. We can now define the

138: \emph{expected state occupancy} as

139: \begin{equation}

140:   \vec{z} = E[\vec{s}|\vec{s}_0]

141:   = \sum_{t=0}^\infty \gamma^t \vec{s}_t

142:   = \sum_{t=0}^\infty (\gamma \vec{F})^t \vec{s}_0

143:   = (\vec{I} - \gamma \vec{F})^{-1} \vec{s}_0

144: \label{eq:occupancy}

145: \end{equation}

146: where $\gamma$ is a discount factor in order to keep the sum finite.  In the

147: last step, we have recognized the sum as the Neumann representation of

148: the inverse. Notice that $\vec{z}$ is a solution of the linear

149: equation

150: \begin{equation}

151:   (\vec{I} - \gamma \vec{F}) \vec{z} =  \vec{s}_0

152: \label{eq:system}

153: \end{equation}

154: which is just the familiar Bellman equation for the expected occupancy

155: probability $\vec{z}$.

156:

157: %%[IVO: huh??!!]

158: %Let us also define

159: %\begin{equation}

160: %  \mathcal{K}(\vec{F}) \,|\, : \vec{s}_0 \mapsto \vec{z}

161: %\end{equation}

162: %as the Bellman operator that maps and

163: %\begin{equation}

164: %  \mathcal{K}^*:  \vec{z} \mapsto (\vec{F},\vec{s}_0)

165: %\end{equation}

166:

167: %----------------------------------------------------------------------

168: \subsubsection*{Expected reward function}

169: %----------------------------------------------------------------------

170: In \emph{reinforcement learning} (RL) the objective is to maximize

171: future reward. We define a reward vector $\vec{r}$ in the same domain

172: as $\vec{s}$. Using the expected occupancy $\vec{z}$ the future

173: expected reward $H$ is simply

174: \begin{equation}

175:   H = \langle\vec{r}, \vec{z} \rangle

176: \label{eq:rl-error}

177: \end{equation}

178: where $\langle \cdot,\cdot \rangle$ is the scalar vector product.%

179: %

180: \footnote{ In \emph{optimal control} (OC) we want to reach some target

181:   state under some optimality conditions (mostly minimum time or

182:   minimum energy). We denote $\vec{r}$ as our target distribution and

183:   denote the time-to-arrival as $t^*$. If $t^*$ were known beforehand

184:   then we

185:   \begin{equation}

186:     H = \frac{1}{2} ( \vec{r} - \vec{s}_{t^*} )^2

187:     = \frac{1}{2} ( \vec{r} - \vec{F}^{t^*} \vec{s}_0 )^2

188:   \end{equation}

189:

190:   However the exact arrival time is mostly not known beforehand, the

191:   most we can do is to use minimize the (time-weighted) expected error

192:   \begin{equation}

193:     H = \sum_{t^*=0}^\infty \gamma^{t^*}

194:     \frac{1}{2} ( \vec{r} - \vec{F}^{t^*} \vec{s}_0 )^2

195:     = \frac{1}{2} k \vec{r}^2

196:     - \vec{r} \vec{z} + \frac{1}{2} \sum_{t^*=0}^\infty \gamma^{t^*}

197:     \left( \vec{F}^{t^*} \vec{s}_0 \right)^2

198:     \label{eq:control-error}

199:   \end{equation}

200:   The first term on the right hand side is often not relevant because it

201:   is independent of $\vec{F}$. When we compare

202:   Eq.~\ref{eq:control-error} with Eq.~\ref{eq:rl-error}, we see that the

203:   OC error function has a quadratic term in $\vec{F}$ which the RL error

204:   lacks.

205: }

206:

207: Because $\vec{z}$ is a solution of Eq.~\ref{eq:system} it is dependent

208: on $\vec{F}$ which in turn depends on policy $\vec{P}$. Given

209: $\vec{r}$ and $\vec{s}_0$, our task is to find the optimal $\vec{P}^*$

210: such that $H$ is maximized, i.e.

211: \begin{equation}

212:   \vec{P}^* = \arg \max_{\vec{P}} H .

213: \end{equation}

214:

215: We can regard the calculation of the future expected reward as a

216: composition of two operators

217: \begin{equation}

218:   \mathcal{Q}: \vec{F} \mapsto \vec{z}

219: \end{equation}

220: which maps the transition matrix $\vec{F}$ to the expected occupancy

221: probabilities $\vec{z}$, and

222: \begin{equation}

223:   \mathcal{R}: \vec{z} \mapsto R

224: \end{equation}

225: which maps the probabilities $\vec{z}$, given a reward distribution

226: $\vec{r}$, to an expected reward value $R$.

227:

228: %----------------------------------------------------------------------

229: \subsubsection*{Calculation of the policy gradient}

230: %----------------------------------------------------------------------

231: A variation $\delta \vec{z}$ in the expected occupancy can be related

232: to first order to a perturbation $\delta \vec{P}$ in the (stochastic)

233: policy. To obtain the partial derivatives $\partial \vec{z} / \partial

234: P_{ik}$, we differentiate Eq.~\ref{eq:system} with respect to $P_{ik}$

235: and obtain:

236: \begin{equation}

237:   - \gamma \frac{\partial \vec{F}}{\partial P_{ik}} \vec{z}

238:   + (\vec{I} - \gamma \vec{F})\frac{\partial \vec{z}}{\partial P_{ik}}

239:   =  0.

240: \label{eq:diff1}

241: \end{equation}

242: The right hand side of the equation is zero because we assume that

243: $\vec{s}_0$ is independent of $P_{ik}$.

244: Rearranging gives:

245: \begin{equation}

246:   \frac{\partial \vec{z} }{\partial P_{ik} } =

247:   \gamma \vec{K} \frac{\partial \vec{F}}{\partial P_{ik} } \vec{z}

248: \label{eq:diff2}

249: \end{equation}

250: where $\vec{K} = (\vec{I} - \gamma \vec{F})^{-1}$. %

251:

252: From Eq.~\ref{eq:rl-error} and Eq.~\ref{eq:diff1}, together with the

253: chain rule, we obtain the gradient of the RL error with respect to the

254: policy parameters $P_{ik}$:

255: \begin{equation}

256:   \frac{ \partial H } { \partial P_{ik} } =

257:   \frac{ \partial H } { \partial \vec{z} }

258:   \frac{ \partial \vec{z} } { \partial P_{ik} }

259:   = \left\langle \vec{r}, \gamma \vec{K} \frac{\partial \vec{F}}

260:     {\partial P_{ik}} \vec{z} \right\rangle

261:   = \gamma \left\langle \vec{K}^*\vec{r}, \frac{\partial \vec{F}}

262:     {\partial P_{ik}} \vec{z} \right\rangle

263: \label{eq:gradient}

264: \end{equation}

265: where $A^*$ means the adjoint operator of $A$ defined by $\langle u, A

266: w \rangle = \langle A^*u, w \rangle$. Let us define:

267: $\vec{q}=\vec{K}^*\vec{r}$. While $\vec{K}$ maps the initial state

268: $\vec{s}_0$ to the future expected state occupancy $\vec{z}$, its

269: adjoint, $\vec{K}^*$, maps the reward vector $\vec{r}$ back to

270: \emph{expected reward} $\vec{q}$. The value of $q_i$ represents the

271: (pointwise) expected reward in state

272: $s_i$ for policy $\vec{P}$.~%

273: \footnote{Indeed, this is a different way to define the traditional

274:   \emph{value-function}. Note that generally, neither $\vec{r}$ nor

275:   $\vec{q}$ are probabilities because their 1-norm is generally not 1.

276: }

277:

278: Finally, differentiating Eq.~\ref{eq:projection} gives us

279: $\partial \vec{F} / \partial P_{ik}$. Inserting this into

280: Eq.~\ref{eq:gradient} yields:

281: \begin{equation}

282:   G_{ik} = \frac{ \partial H } { \partial P_{ik} }

283:   \propto \, z_i \! \sum_j T_{kji} q_j.

284: \label{eq:gradient-p}

285: \end{equation}

286: In words, the gradient of $H$ with respect to policy parameter

287: $P_{ik}$ (i.e. the probability of taking action $a_k$ in state $s_i$)

288: is proportional to the expected occupancy $z_i$ times the weighted sum

289: of expected reward $q_j$ over next states $(j=1,...,N)$ weighted by

290: the transition probabilities $T_{kji}$.

291:

292: Note that the gradient could also have been approximated using finite

293: differences which would need at least $1+n^2$ field

294: calculations.\footnote{ Finite difference approximation of the

295:   derivative $\partial \vec{z} / \partial T_{ij}$ involves computing

296:   $\vec{z}$ for $\vec{F}$ and then perturbing a single $T_{ij}$ in

297:   $\vec{F}$ by a tiny amount $dT$ and subsequently recomputing

298:   $\vec{z}'$. Then the derivative is approximated by $\partial \vec{z}

299:   / \partial T_{ij} \approx (\vec{z}'-\vec{z})/dT$. For a $n\times n$

300:   matrix $\vec{F}$, one would need to repeat this for every element

301:   and would require a total upto $1+n^2$ calculations of $\vec{z}$.}

302: The adjoint method is much more efficient and needs only \emph{two}

303: field calculations.

304:

305: %\paragraph{Optimal control}

306: %The OC error has an additional quadratic term in $\vec{F}$

307: %\begin{equation}

308: %\frac{\partial} { \partial T_{ij} } =

309: %\sum_{t^*=0}^\infty t^* \gamma^{t^*} \vec{F}^{2t^*-1} \vec{s}_0

310: %\frac{\partial \vec{F}} { \partial T_{ij} }\vec{s}_0

311: %\end{equation}

312:

313: Once we have the gradient $\vec{G}$, improving policy $\vec{P}$ is now

314: straight forward using gradient ascent or we can also use more

315: sophisticated gradient-based methods such as nonlinear conjugate

316: gradients (as in~\cite{baxter99direct}). The optimization is nonlinear

317: because $\vec{z}$ and $\vec{r}$ themselves depend on the current

318: estimate of $\vec{P}$.

319:

320: %========================================================================

321: \section{Computation of the optimal policy}

322: %========================================================================

323:

324: We will introduce two algorithms that incorporate our ideas of

325: gradient-based reinforcement planning. The first algorithms describes

326: an off-line planning algorithm that finds the optimal policy but

327: assumes that the environment transition probabilities are known. The

328: second algorithm is an online version that could cope with unknown

329: environments.

330:

331: %----------------------------------------------------------------------

332: \subsection{Offline GREP}

333: %----------------------------------------------------------------------

334:

335: If the environment transition probabilities $T_{kji}$ are known, the

336: agent may improve its policy using GREP. Our offline GREP planning

337: algorithm consist of two steps:

338:

339: \begin{enumerate}

340: \item {\em Plan ahead:} Compute the policy gradient $\vec{G}$ in

341:   Eq.~\ref{eq:gradient} and improve current policy

342:   \begin{equation}

343:     \vec{P} \leftarrow \vec{P} + \alpha \vec{G}

344:     \label{eq:t-update}

345:   \end{equation}

346:   where $\alpha$ is a suitable step size parameter; for efficiency we

347:   can also perform a linesearch on $\alpha$.

348: \item {\em Evaluate policy:} Repeat above until policy is optimal.

349: \end{enumerate}

350:

351: Matrix $\vec{P}$ describes a probabilistic policy. We define the

352: \emph{maximum probable policy} (MPP) to be the deterministic policy by

353: taking the maximum probable action at each state. It is not obvious

354: that the MPP policy will converge to the global optimal solution but

355: we expect MPP at least to be near-optimal.

356:

357: %========================================================================

358: \subsubsection*{Numerical experiments}

359: %========================================================================

360:

361: \begin{figure} \centering

362:   \includegraphics[width=0.147\textwidth]{maze.eps}

363: %  \includegraphics[width=0.155\textwidth]{z.epsi}

364: %  \includegraphics[width=0.155\textwidth]{q.epsi}

365:   \includegraphics[width=0.155\textwidth]{zt.eps}

366:   \includegraphics[width=0.155\textwidth]{qt.eps}

367: \caption{Left: $10\times10$ toy maze with start at left and goal at right side.

368:   Center: plot of expected occupancy $\vec{z}$. Right: plot of

369:   expected reward $\vec{q}$. White corresponds to higher probability.

370:   [Blurring is due to visualisation only]. }

371: \label{fig:maze}

372: \end{figure}

373:

374: \begin{figure*} \centering

375:   \includegraphics[width=0.45\textwidth]{pol1}\hfill

376:   \includegraphics[width=0.45\textwidth]{pol2}

377: \caption{%

378:   Plot of simulated path length versus GREP iteration of a small toy

379:   MDP maze for the probability weighted (PW) policy, annealed PW

380:   policy and MPP policy. The shortest path to goal is 14. Left:

381:   starting from initial uniform policy. Right: starting from initial

382:   random policy. }

383: \label{fig:reward}

384: \end{figure*}

385:

386: We performed some numerical experiments using offline GREP. Our test

387: problem was a pure planning task in a $10\times10$ toy maze (see

388: Fig.~\ref{fig:maze}) where the probabilistic policy $\vec{P}$

389: represents the probability of taking a certain action at a certain

390: maze position. The same figure also shows typical solutions for the

391: quantities $\vec{z}$ and $\vec{q}$, i.e. the expected occupancy and

392: expected reward respectively (for certain $\vec{P}$).

393:

394: After each GREP iteration, i.e. after each gradient calculation and

395: $\vec{P}$ update, we checked the obtained policy by running 20

396: simulations using the current value of $\vec{P}$. The probability

397: weighted (PW) policy selects action $k$ at state $i$ proportional to

398: $P_{ik}$, while the annealed PW policy uses an annealing factor of

399: $T=4$; we also simulated the MPP solution. Figure~\ref{fig:reward}

400: shows the average simulated path length versus GREP iteration of the

401: PW, the annealed PW policy and the derived MPP policy. In the left

402: plot the initial policy $\vec{P}$ was taken uniform. The right plot in

403: the same figure shows the simulated path lengths from a random policy;

404: also here the MPP finds the optimal solution but slightly later.

405:

406: We see from the figure that in both cases the probability-weighted

407: (PW) policy is improving during the GREP iterations. However, the

408: convergence is very slow which shows the severe non-linearity of the

409: problem. The annealed PW does perform better than PW. Finally, we see

410: that MPP finds the optimal solution quickly within a few iterations.

411: Using Dijkstra's method, we confirmed that the found MPP policy was in

412: agreement with the global shortest path solution.

413:

414: %----------------------------------------------------------------------

415: \subsection*{Online GREP}

416: %----------------------------------------------------------------------

417: The account below desribe an idea to use GREP when the environment is

418: not known beforehand\footnote{At the time of writing, we have not

419:   implemented this idea yet}. The steps actually interleave ``Kalman

420: filter''-like estimation of the unknown environment transition

421: probabilities with the explicit planning of GREP. In fact, it also

422: includes a step to estimate a possibly unknown (linear) sensor

423: mapping. Apart from the policy matrix $\vec{P}$, we need to estimate

424: also the (environment) transition probabilities $T_{kji}$ and possibly

425: sensor matrix $\vec{B}$. We can optimize for all parameters by

426: iteratively ascending to their conditional mode. The conditional

427: maximizing steps are easy:

428: %

429: \begin{enumerate}

430: \item {\em Plan ahead:} Compute the policy gradient $\vec{G}$ in

431:   Eq.~\ref{eq:gradient} and improve current policy

432:   \begin{equation}

433:     \vec{P} \leftarrow \vec{P} + \alpha \vec{G}

434:     \label{eq:t-update}

435:   \end{equation}

436:   where $\alpha$ is a suitable step size parameter; for efficiency we

437:   can also perform a linesearch. After, we need to renormalize the

438:   columns of $\vec{P}$. See note below on policy regularization.

439:

440: \item {\em Select action:} Given state estimate $\vec{s}_t$, draw an

441:   action $k$ from the policy according to:

442:   \[ k \sim \vec{P}_{\!t} \, \vec{s}_t. \]

443:   and receive reward $R$ and estimate new state $\vec{s}_{t+1}$.

444:

445: %%^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

446: %% IVO: leaving out stuff below because not directly relevant in

447: %%      THIS paper. Very interesting though; NEXT paper?!

448: %%^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

449: \item {\em Estimate state:} Observe $\vec{y}_{t+1}$ and estimate

450:   $\vec{s}_{t+1}$ using

451:   \begin{equation}

452:     p(\vec{s}_{t+1} | \vec{y}_{t+1}, \vec{s}_{t}, k )

453:     \; \propto \;

454:     p(\vec{y}_{t+1} | \vec{s}_{t+1}) \,

455:     p(\vec{s}_{t+1} | \vec{F}_k, \vec{s}_{t} ).

456:   \end{equation}

457:   Assuming Gaussian noise for observations and state estimates we

458:   obtain:

459:   \begin{equation}

460:     \vec{s}_{t+1} = \frac{

461:     \vec{H}_s \vec{F}_k, \vec{s}_{t} +

462:     \vec{B}^{-1} \vec{H}_y \vec{y}_{t+1} }

463:   { \vec{H}_s + \vec{B}^{-1} \vec{H}_y \vec{B}^{-T} }

464:   \end{equation}

465:   where $\vec{B}$ is the \emph{sensor matrix} that maps internal state

466:   $\vec{s}$ to observations $\vec{y}$. Matrices $\vec{H}_s$ and

467:   $\vec{H}_y$ are the inverse covariance (or so called

468:   \emph{precisions}) of state $\vec{s}$ and observation $\vec{y}$

469:   respectively.

470:

471: \item {\em Estimate sensors:} In case also sensor matrix $\vec{B}$ is

472:   unknown, we have to perform an additional estimation step for

473:   $\vec{B}$. This is common step in the standard Kalman formulation.

474:

475: \item {\em Estimate environment:} Given action $k$ and reward $R$, we

476:   update the reward vector

477:   \begin{equation}

478:     r_j \leftarrow R

479:   \end{equation}

480:   and the environment transition probabilities

481:   \begin{equation}

482:    \dd \vec{F}_k \;\propto\; (\vec{s}_{t+1} - \vec{F}_k \vec{s}_t )

483:    \vec{s}_t^T  (\vec{s}_t \vec{s}_t^T)^{-1},

484:  \end{equation}

485:  or if $(\vec{s}_t \vec{s}_t^T)^{-1}$ does not exist we can use

486:  \begin{equation}

487:    \dd \vec{F}_k \;\propto\; (\vec{s}_{t+1} - \vec{F}_k \vec{s}_t )

488:    (\vec{s}_t^T \vec{s}_t)^{-1} \vec{s}_t^T.

489:  \end{equation}

490:   where $\vec{s}_t = j$, and reestimate the environment transition

491:   probabilities

492:   \begin{equation}

493:     \vec{T}_{k} \leftarrow

494:     \vec{T}_{k} + (\vec{s}_{t+1} - \vec{T}_k \vec{s}_t ) \vec{s}_t^T.

495:   \end{equation}

496:   After the update, one should set entries in $\vec{F}'_k$ that

497:   corresponds to physically impossible transitions to zero.

498:   After, we need to renormalize the columns of $\vec{T}_k$. It is

499:   important to note that given $\vec{s}_{t}$ and $\vec{s}_{t+1}$

500:   transition matrix $\vec{T}_k$ is conditionally independent of the

501:   policy $\vec{P}$. That is to say, we can obtain an accurate model of

502:   the environment using, e.g., just a random walk.

503: \item Repeat 1.

504: \end{enumerate}

505:

506: To draw a picture of what is happening. In the planning stage, based

507: on the current (and maybe not accurate) environment model, the agent

508: tries to improve its current policy by planning ahead using the

509: gradient in Eq.~\ref{eq:gradient-p}. Remember that the gradient

510: involves simulating paths from the current state and adjoint paths

511: from the goal. In the action stage the agent samples an action from

512: its policy. Then the agent senses the new state and updates its

513: environment model using this new information. Notice that policy

514: improvement is not done ``backwards'' as traditionally is done in DP

515: methods but ``forward'' by planning ahead.

516:

517: %========================================================================

518: \section{Conclusions}

519: %========================================================================

520:

521: \subsubsection*{Future topics}

522: We have tacitly assumed that $\vec{z}$ and $\vec{q}$ are computed

523: using the same discount factor $\gamma$. However, we could introduce

524: separate parameters $\gamma_z$ and $\gamma_q$ which effectively

525: assigns a different ``forward time window'' for $\vec{z}$ and a

526: ``backward time window'' for $\vec{q}$. In fact when $\gamma_z

527: \rightarrow 0$ we have a ``one-step-look-ahead''. Alternatively, in

528: the limit of $\gamma_q \rightarrow 0$ we obtain a gradient for a

529: greedy policy that maximize only ``immediate reward''. How both

530: parameters affect GREP's performance is a topic for future research.

531:

532: The above suggests that GREP can be viewed as a generalization to

533: ``one-step-look-ahead'' policy improvement.  In fact, a

534: ``one-step-look-ahead'' improvement rule using can be obtained for

535: $\gamma_z \rightarrow 0$ simply by taking $\vec{z}=\vec{s}_t$ in

536: Eq.~\ref{eq:gradient-p}.  Such an approach would be ``policy greedy''

537: in a sense that it updates the policy only locally. We expect GREP to

538: perform better because it updates the policy more globally; whether

539: this in fact improves GREP is also a remaining issue for future

540: research.

541:

542: The interleaving of GREP with a Kalman-like estimation procedure of

543: the environment could handle a variety of interesting problems such as

544: planning in POMDP environments.

545:

546: We must mention that appropriate reparameterization of the stochastic

547: policy, e.g. using a Boltzman distribution, could improve the

548: convergence. We have not pursued this further.

549:

550: \subsubsection*{Summary}

551: We have introduced a learning method called ``gradient-based

552: reinforcement planning'' (GREP). GREP needs a model of the

553: environment and plans ahead to improve its policy \emph{before} it

554: actually acts in the environment.  We have derived formulas for the

555: exact policy gradient.

556:

557: Numerical experiments suggest that the probabilistic policy indeed

558: converges to an optimal policy---but quite slowly. We found that (at

559: least in our toy example) the optimal solution can be found much

560: faster by annealing or simply by taking the most probable action at

561: each state.

562:

563: Further work will be to incorporate GREP in online RL learning tasks

564: where the environment parameters, i.e. transition probabilities

565: $T_{kji}$, are unknown and have to be learned. While an analytical

566: solution for $\vec{q}$ and $\vec{z}$ are only viable for small problem

567: sizes, for larger problems we probably need to investigate Monte Carlo

568: or DP methods.%

569:

570: %========================================================================

571: %\bibliographystyle{plain}

572: %\bibliographystyle{mlapa}

573: {\small

574: %\bibliography{grep}

575: \begin{thebibliography}{}

576:

577: \bibitem[Baird, 1998][Baird][1998]{baird98gradient}

578: Baird, L.~C. (1998).

579: \newblock Gradient descent for general reinforcement learning.

580: \newblock {\em Advances in Neural Information Processing Systems}.

581: \newblock {MIT} Press.

582:

583: \bibitem[Baxter \& Bartlett, 1999][Baxter and Bartlett][1999]{baxter99direct}

584: Baxter, J., \& Bartlett, P. (1999).

585: \newblock {\em Direct gradient-based reinforcement learning: I. gradient

586:   estimation algorithms} (Technical Report).

587: \newblock Research School of Information Sciences and Engineering, Australian

588:   National University.

589:

590: \bibitem[Difilippo et~al.\/, 1996][Difilippo et~al.\/][1996]{difilippo}

591: Difilippo, F.~C., Goldstein, M., Worley, B.~A., \& Ryman, J.~C. (1996).

592: \newblock Adjoint {M}onte {C}arlo methods for radiotherapy treatment planning.

593: \newblock {\em Trans. Am. Nucl. Soc.}, {\em 74}, 14--16.

594:

595: \bibitem[Ng et~al.\/, 1999][Ng et~al.\/][1999]{ng99policy}

596: Ng, A., Parr, R., \& Koller, D. (1999).

597: \newblock Policy search via density estimation.

598: \newblock {\em Advances in Neural Information Processing Systems}.

599: \newblock {MIT} Press.

600:

601: \bibitem[Schmidhuber, 1990][Schmidhuber][1990]{Schmidhuber:90sandiego}

602: Schmidhuber, J. (1990).

603: \newblock An on-line algorithm for dynamic reinforcement learning and planning

604:   in reactive environments.

605: \newblock {\em Proc. IEEE/INNS International Joint Conference on Neural

606:   Networks, San Diego} (pp.\/ 253--258).

607:

608: \bibitem[Sutton \& Barto, 1998][Sutton and Barto][1998]{suttonbarto}

609: Sutton, R.~S., \& Barto, A.~G. (1998).

610: \newblock {\em Reinforcement learning. {A}n introduction}.

611: \newblock {MIT} Press, Cambridge.

612:

613: \bibitem[Sutton et~al.\/, 2000][Sutton et~al.\/][2000]{sutton00policy}

614: Sutton, R.~S., McAllester, D., Singh, S., \& Mansour, Y. (2000).

615: \newblock Policy gradient methods for reinforcement learning with function

616:   approximation.

617: \newblock {\em Advances in Neural Information Processing Systems}.

618: \newblock {MIT} Press.

619:

620: \bibitem[Williams, 1992][Williams][1992]{williams92simple}

621: Williams, R.~J. (1992).

622: \newblock Simple statistical gradient-following algorithms for connectionist

623:   reinforcement learning.

624: \newblock {\em Machine Learning}, {\em 8}, 229--256.

625:

626: \end{thebibliography}

627:

628: }

629:

630:

631: %========================================================================

632: %========================================================================

633: %========================================================================

634:

635:

636: %========================================================================

637: \section*{Appendix A: Implicit policies}

638: %========================================================================

639: In deterministic environments where the state-action pair $(s_i,a_m)$

640: uniquely leads to a state $s_j$, i.e. $T_{kji} = \delta_{km}

641: (k=1,...,K)$ the projection $\vec{F}$ is solely determined

642: by the policy $\vec{z}$, and \emph{vice versa}. We refer to this as

643: the case of \emph{implicit policy} because the policy is implicitly

644: implied in the induced transition probability $T_{ji}$.

645:

646: In such environments we can suffice to solve for $\vec{F}$ directly

647: and omit parameterization through $\vec{z}$. From

648: Eq.~\ref{eq:projection} we see that

649: \begin{equation}

650:   P_{im} = T_{ji}

651: \end{equation}

652: and using a similar derivation as we have done for $\partial H /

653: \partial P_{ik}$, it can be shown that the gradient of $H$ with

654: respect to $\vec{F}$ is given by

655: \begin{equation}

656:   \vec{G} = \vec{r} \vec{z}^T.

657: \label{eq:rank-one}

658: \end{equation}

659: An important point must be mentioned. In most cases many elements

660: $T_{ji}$ are zero, representing an absent transition between $S_i$ and

661: $S_j$. Naively updating $\vec{F}$ using the full gradient $\vec{G}$

662: would incur complete fill-in of $\vec{F}$ which is in most cases not

663: desirable or even physically incorrect. Therefore, one must check the

664: gradient each time and set impossible transition probabilities to

665: zero. We will refer to this ``heuristically corrected'' gradient as

666: $\widetilde{\vec{G}}$. Also, after each update, we have to renormalize

667: the columns of $\vec{F}$. The rank-one update in Eq.~\ref{eq:rank-one}

668: is interesting because it provides an efficient means of calculating

669: the inverse in Eq.~\ref{eq:occupancy}.

670:

671: %% [IVO: Hmm... not proven.... leave out??!!]

672: %%

673: %Then, writing $\vec{F}' = \vec{F} + \alpha

674: %\vec{r}\vec{z}^T$ and $\vec{B} = \vec{I} - \gamma \vec{F}$ we have:

675: %\begin{eqnarray}

676: %  \vec{z}'

677: %  &=& ( \vec{I} - \gamma \vec{F}' )^{-1} \vec{s}_0

678: %  \; = \; ( \vec{B} - \beta \vec{r} \vec{z}^T )^{-1} \vec{s}_0  \nonumber \\

679: %  &=& \vec{z} + \frac{ \vec{B}^{-1} \widetilde{\vec{G}} }

680: %  { \beta^{-1} - \mbox{tr}\,\vec{B}^{-1} \widetilde{\vec{G}} } \, \vec{z}

681: %\end{eqnarray}

682: %where we used the rank-one update formula for the inverse:

683: %\begin{equation}

684: %  \left(\vec{F} + \vec{u} \vec{v}^T \right)^{-1} =

685: %  \vec{F}^{-1} - \frac{ \vec{F}^{-1} \vec{u} \vec{v}^T \vec{F}^{-1} }

686: %  { 1 + \vec{v}^T \vec{F}^{-1} \vec{u} }

687: %\end{equation}

688: %provided that $1 + \vec{v}^T \vec{F}^{-1} \vec{u} \neq 0$.

689:

690:

691: %========================================================================

692: \section*{Appendix B: Monte Carlo gradient sampling}

693: %========================================================================

694:

695: In our example, we calculated $\vec{z}$ and $\vec{q}$ in

696: Eq.\ref{eq:occupancy} by linear programming. For large state spaces

697: the matrix inversion quickly becomes too computationally intensive and

698: probably traditonal dynamic programming based methods would be more

699: efficient.

700:

701: Instead, we investigated to use Monte Carlo (MC) simulation. We use

702: \emph{forward sampling} to approximate the expected state occupancies

703: in $\vec{z}$ and use, so-called, \emph{adjoint Monte Carlo

704:   sampling}~\cite{difilippo} to estimate the adjoint reward $\vec{q}$.

705: Adjoint MC simulation of is far more efficient than would

706: we have estimated each $q_i$ by a separate MC run.%

707: \footnote{With one MC run we mean performing, e.g., 10000 trials from

708:   a fixed state.} %

709: By performing the simulation backward from $\vec{r}$, we obtain all

710: values of $\vec{q}$ using only a single MC run.

711:

712: \begin{figure} \centering

713:   \includegraphics[height=0.28\textwidth]{f6} \hfill

714:   \includegraphics[height=0.28\textwidth]{f6a} \hfill

715:   \includegraphics[height=0.28\textwidth]{g6} \\

716:   \hspace{24mm} (a) \hfill (b) \hfill (c) \hspace{20mm}

717: \caption{ \it MC calculations of a $6\times6$ toy maze. The agent is

718:   at (0,1) and targets (5,4).  (a) Expected occupancy, (b) adjoint

719:   probability, and (c) normalized policy gradient for $n=4$, $n=40$,

720:   $n=200$. Each vector is computed as $\vec{v} = \sum_k P_{ik}

721:   \vec{e}_k$ where $\vec{e}_k$ is the unit vector along the state

722:   change induced by action $a_k$.}

723: \label{fig:mc-gradientplots}

724: \end{figure}

725:

726: Fig.~\ref{fig:mc-gradientplots} shows the MC approximations of

727: $\vec{z}$ and $\vec{q}$. On the right of the same figure, we have

728: plotted the computed policy-gradient based on MC estimates using a

729: minimum number of $n=\{20, 40, 200\}$ samples. To compare them with

730: the exact gradient, we calculated the exact values of $\vec{z}$ and

731: $\vec{q}$ by inverting the linear system.  For larger number of

732: samples, the gradient vector do indeed point more strongly towards the

733: goal.

734:

735: An important feature of general Monte Carlo methods is that they

736: automatically concentrate their sampling to the important regions of

737: the parameter space ---mostly proportional to the posterior or the

738: likelihood. For our purpose of sampling the gradient, to even more

739: concentrate the sampling density towards the regions of large gradient

740: values, we have tried to apply \emph{annealing}. To sample from a

741: density $p(\theta)$ we may sample from the annealed function

742: $p_\gamma(\theta) = p(\theta)^\gamma / \int p(\theta)^\gamma d\theta$

743: and reweight each sample with its importance weight

744: $1/p_\gamma(\theta)$. For $\gamma \rightarrow \infty$, the set of

745: samples converges to the maximum probable gradient.

746:

747: In conclusion, our approach of separately estimating $\vec{q}$ and

748: $\vec{z}$ using MC and \emph{then} (elementwise) multiply their

749: solutions, doesn't really brought clear advantages. If we could sample

750: from the joint distribution $q_i z_i$ (i.e. elementwise product) then

751: MC would clearly turn out to be a very efficient method.

752:

753: \end{document}

754:

755:

756:

757:

758:

759: