0511:cs0511088/p.tex

1: \documentclass[10pt,twocolumn,letterpaper,twoside]{article} % a4paper

2: %%% CVS version control block - do not edit manually

3: %%%  $RCSfile: p.tex,v $

4: %%%  $Revision: 1.64.2.3 $

5: %%%  $Date: 2005/11/25 15:56:58 $

6: %%%  $Source: /home/cvs/papers/query-max/p.tex,v $

7:

8: \usepackage{comment}

9: \usepackage{amsmath}

10: \usepackage{amsfonts}

11: \usepackage[headings,in]{fullpage}

12: \usepackage[dvips]{graphicx}

13: \usepackage{psfrag}

14: \usepackage[nice]{nicefrac}

15: \usepackage{newcent}		% bookman times

16: \usepackage[round]{natbib}

17: \usepackage{url}

18: \urlstyle{same}

19: %%\usepackage{fancyhdr}

20: %%\usepackage[today,short]{rcsinfo}

21: %%\rcsInfo $Id: p.tex,v 1.64.2.3 2005/11/25 15:56:58 bap Exp $

22:

23: \DeclareMathOperator{\var}{var}

24: \DeclareMathOperator{\stderr}{stderr}

25: \newcommand{\bigfrac}[2]{\frac{\displaystyle #1}{\displaystyle #2}}

26: \newcommand{\ifrac}[2]{({#1}/{#2})}

27: \newcommand{\bigsum}[0]{\displaystyle\sum}

28: \providecommand{\abs}[1]{\lvert#1\rvert}

29: \newcommand{\ie}[0]{\emph{i.e.}}

30: \newcommand{\eq}[1]{Eq.~\ref{eq:#1}}

31: \newcommand{\fig}[1]{Fig.~\ref{fig:#1}}

32: \DeclareMathOperator{\noise}{\mathcal{N}}

33:

34: %%% Allow figures on page/columns with just a little regular text

35: \renewcommand\topfraction{.99}	%1

36: \renewcommand\bottomfraction{.99}%1

37: \renewcommand\textfraction{.01}	%0

38: \renewcommand\floatpagefraction{.99}

39: \setcounter{totalnumber}{50}

40: \setcounter{topnumber}{50}

41: \setcounter{bottomnumber}{50}

42:

43: \newlength{\gwidth}

44: \setlength{\gwidth}{0.49\textwidth}

45:

46: \graphicspath{

47:   {figures/}

48:   {/home/barak/src/papers/query-max/figures/}

49:   {figs/}

50:   {figs/home/barak/src/papers/query-max/figures/}

51:   {./}				% for, eg, arXiv

52: }

53: \DeclareGraphicsExtensions{.eps,.eps.gz,.jpg,.gif,.png}

54: \DeclareGraphicsRule{.eps}{eps}{.eps}{}

55: \DeclareGraphicsRule{.eps.gz}{eps}{.eps.gz.bb}{}

56: %\DeclareGraphicsRule{.png}{eps}{.png.bb}{`convert #1 eps:-}

57: %\DeclareGraphicsRule{*}{eps}{.bb}{`convert #1 eps:-}

58:

59: \setlength{\parskip}{2ex}

60: \setlength{\parindent}{0ex}

61:

62: \title{Bounds on Query Convergence}

63:

64: \author{\textbf{Barak A. Pearlmutter}\thanks{Hamilton Institute, NUI

65: Maynooth, Co.\ Kildare, Ireland.}}

66:

67: \date{\today\\\small (CVS: \rcsInfoFile\ \rcsInfoRevision)}

68: \date{}

69:

70: \pagestyle{plain} 		% fancy

71: %%\fancyhf[LH]{\emph{Bounds on Query Convergence}}

72: %%\fancyhf[RH]{\emph{Pearlmutter}}

73:

74: \begin{document}

75: \maketitle

76: \thispagestyle{empty}

77:

78: \begin{abstract}

79: The problem of finding an optimum using noisy evaluations of a smooth

80: cost function arises in many contexts, including economics, business,

81: medicine, experiment design, and foraging theory.  We derive an

82: asymptotic bound

83: \begin{math}

84:  E[ (x_t-x^*)^2 ] \geq O(t^{-1/2})

85: \end{math}

86: on the rate of convergence of a sequence $(x_0, x_1, \ldots)$

87: generated by an unbiased feedback process observing noisy evaluations

88: of an unknown quadratic function maximised at $x^*$.  The bound is

89: tight, as the proof leads to a simple algorithm which meets it.  We

90: further establish a bound on the total regret,

91: \begin{math}

92:  E\bigl[ \sum_{\tau=1}^{t} (x_{\tau} - x^*)^2 \bigr] \geq O(t^{1/2}) .

93: \end{math}

94: These bounds may impose practical limitations on an agent's

95: performance, as $O(\epsilon^{-4})$ queries are made before the queries

96: converge to $x^*$ with $\epsilon$ accuracy.

97: \end{abstract}

98:

99: \section{Introduction}

100:

101: Finding an input $x$ to a system so as to optimise some property

102: $f(x)$ of the system's output, using only noisy measurements, is a

103: ubiquitous problem.  For instance, in medicine $x$ might be a drug

104: dosage and $f(x)$ the probability of a successful outcome; in business

105: $x$ might be the price set by a manufacturer and $f(x)$ the consequent

106: profit; in game theory $x$ might be a strategy and $f(x)$ its return;

107: and in evolutionary theory $x$ might be the brightness of a bird's

108: plumage and $f(x)$ the consequent reproductive success.

109:

110: When the measurements of $f(x)$ are noise-free this is a classical

111: optimisation problem, as studied by Gauss.  Optimisation theory

112: remains to this day a productive branch of applied mathematics.  In

113: general, the assumption is made that the function to be optimised

114: takes on a simplified form in the neighbourhood of its optimum---most

115: often, quadratic.  The criterion by which we evaluate such algorithms

116: is typically the convergence rate of its estimate of the location of

117: the optimum, although the complexity of the algorithm itself can also

118: be a consideration.

119:

120: Here we consider a situation in which the measurements of the function

121: are assumed to be noisy.  A similar situation in which noisy

122: measurements of the gradient are available is studied in stochastic

123: gradient optimisation \citep{ROBBINS-MONRO51a, LJUNG77,

124: WIDROW-MCCOOL-LARIMORE-JOHNSON79}.  Here however we assume that

125: gradient information is not available.  We further assume that we are

126: interested not in our \emph{estimate} of the optimum converging as

127: rapidly as possible, but rather in the \emph{queries themselves}

128: converging to the optimum as rapidly as possible.  As a practical

129: matter, the convergence of the queries themselves is important when

130: the function $f(x)$ is a measure of consequence, and making a

131: measurement at $x$ has an actual expected cost of $f(x)$, as in

132: measuring the survival rate of a medical treatment or the return of an

133: economic decision.

134:

135: Gradient information would make this problem much easier.  For

136: illustration, consider two closely related optimisation problems.  In

137: each, an inaccurate rifle with unknown bias can be swivelled

138: horizontally, and we wish to swivel it so as to maximise the

139: probability of hitting a small target.  Due to the inaccuracy of the

140: riffle and the small target size, we are unlikely to hit the target

141: even when the rifle is aimed optimally.  In one situation, we know

142: after each shot whether the bullet went to the left or the right of

143: the target.  In the other situation, we know only whether the bullet

144: hit the target.  Knowing whether the bullet went to the right or the

145: left of the target corresponds to having an estimate of the gradient,

146: and allows rapid convergence to the correct position by simply making

147: successively smaller adjustments after each shot away from the side to

148: which the bullet missed.  But without this gradient information, it is

149: difficult to know in which direction to adjust the aim in response to

150: a miss.  In fact, a single miss in isolation does not seem of any help

151: in improving the aim.  It is our goal here to precisely characterise

152: the difficulty of such situations.

153:

154: \section{Proof Sketch}

155:

156: We construct an inequality which establishes a lower bound on the rate

157: of convergence of the queries $x_t$ to the optimum $x^*$.  The

158: inequality follows from the observation that if the queries $x_t$ are

159: more spread out, the estimate of the optimum $x^*$ will have less

160: uncertainty.  This relationship, in which faster convergence of the

161: queries leads to slower convergence of the estimate of $x^*$, is

162: quantified using the statistical notion of the leverage of the data,

163: which limits the accuracy of an estimate of a slope.  This gives a

164: lower bound on the speed with which the queries $x_t$ can converge to

165: $x^*$.  Violation of the bound would imply a contradiction: that the

166: queries converge to the optimum faster than does the best estimate of

167: the optimum.

168:

169: \section{Detailed Derivation}

170:

171: We consider an unbiased feedback system which uses noisy measurements

172: to find the $x$ which maximises $f(x)$, where $f(x)$ is locally

173: quadratic about its maximum $x^*$.  To simplify the derivation we will

174: assume that $f(x)$ is not merely locally but globally quadratic

175: \begin{equation}

176:  f(x) = - a x^2 + b x + c = -a (x - x^*)^2 + f(x^*)

177: \end{equation}

178: that the quadratic coefficient $a>0$ is known leaving unknown only the

179: linear and constant terms $b$ and $c$, and that each noisy

180: measurements of $f(x)$ is corrupted by zero-mean i.i.d.\ additive

181: noise of variance $\sigma^2$.

182:

183: Let $x_0, x_1, \ldots$ be the sequence of points evaluated.  We

184: establish the following bound:

185:

186: \newtheorem{theorem}{Theorem}

187: %\newtheorem{proof}{Proof}

188: \newtheorem{corollary}{Corollary}

189:

190: \begin{theorem} \label{theorem:main}

191: For sufficiently large $t$ and an unbiased feedback process that

192: calculates $x_t$ using information available prior to $t$,

193: \begin{equation} \label{eq:main_thm} \displaystyle

194:   E[ (x_t - x^*)^2 ] \geq \frac{\sigma}{\sqrt{8} \, a} \, t^{-1/2}

195: \end{equation}

196: \end{theorem}

197:

198: \textbf{Proof:}

199: %

200: Since $a$ is known we can add $a x_t^2$ to the measurements and fit

201: $b$ and $c$ to the resulting noisy line.  The variance of $\hat{b}_t$,

202: the best unbiased estimate of $b$ given measurements made prior to

203: time $t$, is limited by the Cram\'er-Rao bound which depends on the

204: level of measurement noise and the leverage about the sample mean

205: $\overline{x}_t = (x_0 + x_1 + \cdots + x_{t-1})/t$,

206: \begin{equation}

207:   \var \hat{b}_t

208: 	= \bigfrac{ \sigma^2 }

209: 		  { \sum_{\tau<t} (x_{\tau} - \overline{x}_t)^2 } .

210: \end{equation}

211:

212: This leverage is bounded by the leverage about any point; here we

213: choose $x^*$, the desired point of convergence,

214: \begin{equation}

215:   \sum_{\tau<t} (x_{\tau} - \overline{x}_t)^2

216: 	\leq \sum_{\tau<t} (x_{\tau} - x^*)^2

217: \end{equation}

218: so

219: \begin{equation}

220:   \var \hat{b}_t

221: 	\geq \bigfrac{\sigma^2}{ \sum_{\tau<t} (x_{\tau} - x^*)^2 }

222: \end{equation}

223: Because $x^* = b/2a$ the variance of an estimate of $x^*$ is related to

224: the variance of an estimate of $b$,

225: \begin{equation}

226:   \var \hat{x}^*_t = \frac{1}{4a^2} \var \hat{b}_t

227: \end{equation}

228: where $\hat{x}^*_t$ is the best unbiased estimate of $x^*$ given

229: measurements made prior to $t$.  By definition $\hat{x}_t^*$ cannot be

230: a worse estimate of $x^*$ than is $x_t$, and we have already seen a

231: bound on the quality of the estimate $\hat{x}_t^*$, so

232: \begin{equation} \label{eq:two_sided}

233:   E[ (x_t - x^*)^2 ]

234: 	\geq \var \hat{x}^*_t

235: 	\geq \bigfrac{\sigma^2}{4a^2 \sum_{\tau<t} (x_{\tau} - x^*)^2 }

236: \end{equation}

237: where the expectation $E[\cdot]$ is taken over realisations of the

238: measurement noise.

239:

240: We now assume\footnotemark\ that $x_t$ convergences polynomially,

241: $E[(x_t - x^*)^2] = (k t^r)^2$, and substitute this above to find $r$

242: and $k$.  The leverage about $x^*$ can be evaluated,

243: %

244: \footnotetext{If the fastest possible convergence bound were not of

245: this form then we would obtain a valid bound, but not a tight one.

246: However, we constructively show that the bound obtained is tight.}

247: \begin{equation} \label{eq:form}

248:   E\Bigl[ \sum_{\tau<t} (x_{\tau} - x^*)^2 \Bigr]

249: 	= k^2 \sum_{\tau<t} \tau^{2r}

250: 	= \frac{k^2}{1+2r} t^{1+2r}

251: \end{equation}

252: \eq{form} can be substituted into the two-sided bound on

253: $\var\hat{x}^*_t$ in \eq{two_sided}, yielding

254: \begin{gather}

255:      k^2 t^{2r}

256:      = E[ (x_t - x^*)^2 ]

257:      \geq \var \hat{x}^*_t

258:      \geq \frac{\sigma^2 (1+2r)}{4 k^2 a^2} t^{-(1+2r)}

259: \nonumber\\

260: \intertext{or}

261:   k^4 \geq \frac{\sigma^2 (1+2r)}{4a^2} t^{-(1+4r)}

262: \end{gather}

263: This can only be satisfied if the right hand side is bounded, which

264: implies that $r \geq -1/4$, and hence

265: \begin{equation}

266:   E[(x_t - x^*)^2] \geq O(t^{-1/2})

267: \end{equation}

268: The most aggressive convergence is for $r=-1/4$, at which point

269: equality is achieved when $k^2 = \sigma/(\sqrt{8} \, a)$.

270: Substituting yields \eq{main_thm}.

271:

272: \begin{corollary}[Bound on Instantaneous Regret]

273: The expected instantaneous regret (loss incurred at time $t$ due to

274: ignorance) of an unbiased online optimiser is bounded below in

275: expectation by

276: \begin{equation}

277:  E[f(x^*) - f(x_t)] \geq \frac{\sigma}{\sqrt{8}} t^{-1/2}

278: \end{equation}

279: \end{corollary}

280:

281: \textbf{Proof:} Note that $f(x^*) - f(x) = a (x - x^*)^2$ and

282: substitute into Theorem \ref{theorem:main}.

283:

284: \begin{corollary}[Bound on Total Regret]

285: The total regret prior to time $t$, defined by

286: \begin{math}

287:  R_t = \sum_{\tau<t} f(x^*) - f(x_{\tau}) ,

288: \end{math}

289: incurred by an unbiased feedback process is bounded below in

290: expectation by

291: \begin{equation}

292:  E[R_t] \geq  \frac{\sigma}{\sqrt{2}} t^{1/2}

293: \end{equation}

294: \end{corollary}

295:

296: \textbf{Proof:} Summation of the bound on instantaneous regret.

297:

298: \textbf{Note:} The expected regret bound is independent of the

299: constant of curvature $a$, whose effect cancels itself out in the

300: analysis.  This is necessarily the case, because we could define

301: $\tilde{f}(x) = f(100 \, x)$ and an attempt to optimise $\tilde{f}(x)$

302: should yield the same regret as an attempt to optimise $f(x)$, despite

303: their differing curvatures.

304:

305: \begin{theorem}[Optimal Algorithm] \label{thm:alg}

306: The stochastic algorithm

307: \begin{equation}

308:  x_t = \hat{x}^*_t + \noise\bigl((\stderr \hat{x}^*_t)^p\bigr)

309: \end{equation}

310: is unbiased and with $p=2$ achieves $E[(x_t - x^*)^2] \sim

311: \ifrac{\sqrt{2} \, \sigma}{a} \, t^{-1/2}$ and $E[R_t] \sim \sigma

312: \sqrt{8 t\,}$, where $\noise(\varsigma^2)$ is zero-mean

313: $\varsigma^2$-variance i.i.d.\ noise and $\stderr \hat{x}^*_t$ is the

314: standard error of the unbiased estimator $\hat{x}^*_t$.

315: \end{theorem}

316:

317: \textbf{Proof:} The algorithm involves only unbiased estimates and is

318: therefore unbiased.

319:

320: The inequalities above become equalities when

321: \begin{equation}

322:  x_t = \hat{x}^*_t + \noise\bigl(\sqrt{2} \, \sigma a \, t^{-1/2} \bigr)

323: \end{equation}

324: which has the same injected variance (up to absorbed constant factors)

325: as in the proposed algorithm.

326:

327: \textbf{Note:} The existence of this algorithm implies that the

328: earlier bounds are tight.  Interestingly, the algorithm does not

329: require knowledge of $a$ or $\sigma$, which are used only in the

330: analysis.  Due to the statistics of the situation, $\stderr

331: \hat{x}^*_t$ scales appropriately with $a$ and $\sigma$.

332:

333: \begin{figure*}[t!]

334: \psfrag{time}[c][c]{$t$}

335: \psfrag{R(t)}[c][c]{$R_t$}

336: \psfrag{ 0}[r][r]{0}

337: \psfrag{ 10000}[r][c]{$10^4$}

338: \psfrag{ 200}[r][r]{200}

339: \psfrag{ 500}[r][r]{500}

340: \psfrag{ 1000}[r][r]{1000}

341: \psfrag{no noise}[c][c]{Greedy: $x_t = \hat{x}^*_t$}

342: \psfrag{p=0.8}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{0.8})$}

343: \psfrag{p=2}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{2})$}

344: \psfrag{p=3.6}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{3.6})$}

345: \includegraphics[width=\gwidth]{plot-3_6}\hfill%

346: \includegraphics[width=\gwidth]{plot-0}\\[2ex]

347: \includegraphics[width=\gwidth]{plot-2}\hfill%

348: \includegraphics[width=\gwidth]{plot-0_8}

349: \caption{Total regret as a function of time for 100 overlaid runs of

350:   the algorithm of Theorem~\ref{thm:alg} (bottom left) which optimally

351:   trades off exploration and exploitation; with $p=0.8$ for more query

352:   noise (bottom right) resulting in less between-run variation but

353:   more regret; with $p=3.6$ for less query noise (top left) resulting

354:   in more between-run variation; and for the greedy strategy, zero

355:   query noise (top right) in which runs rapidly converge to incorrect

356:   estimates.  All runs used $\sigma^2=a=1$, $b=c=0$, and were

357:   initialised with two queries at $x = x^* \pm 1$.}

358: \label{fig:runs}

359: \end{figure*}

360:

361: \begin{figure*}[t!]

362: \centerline{\input{totals.itex}}

363: %\includegraphics[width=\columnwidth]{plot-totals}

364: \caption{Bar graph (log scale) of total regret after $10^6$ queries,

365:   averaged over 100 runs, for the algorithm of Theorem~\ref{thm:alg}

366:   with $\sigma=1$ and $a=1$.  Bars shown for values of $p$ both above

367:   and below the optimal $p=2$, and also for the greedy algorithm of

368:   zero injected noise.  Risers show sample standard deviations.}

369: \label{fig:totals}

370: \end{figure*}

371:

372: \section{Discussion}

373:

374: Although the above theorems all assume unbiased estimates, integration

375: of prior information would, assuming that the prior is smooth, only

376: change an initial transient response of the system, leaving the

377: asymptotic behaviour unchanged.  The limits on regret would change by

378: only a small additive constant whose value would dependant upon the

379: details of the prior.

380:

381: The above exploration/exploitation tradeoff and bound holds when using

382: noisy measurements and the cost of an evaluation is the value of the

383: function being optimised.  The result is robust, in that small changes

384: to the model (a cost function quadratic only in the neighbourhood of

385: the optimum, for instance) will not change their character.

386:

387: However a related situation, finding the zero $x^*$ of a linear

388: function using noisy measurements where the expected loss of a

389: measurement $x_t$ is quadratic in $x_t - x^*$, has a surprisingly

390: different result.  In this matching-shoulders lob-pass case formalised

391: by \citet{ABE-TAKEUCHI93A} based on the foraging theory question posed

392: by \citet{HERRNSTEIN90A}, a convergence rate of $E[(x_t - x^*)^2] =

393: O(t^{-1})$ and thus an expected regret of $E[R_t] = O(\log t)$ can be

394: achieved \citep{KILIAN-ETAL94A, HIRAOKA-AMARI98A,

395: TAKEUCHI-ETAL-2000a}.  This is because the measurements in that

396: setting serve the purpose of gradient information.

397:

398: Procedures which do not insert sufficient variability into their

399: queries acquire only finite leverage, resulting (with probability one)

400: in convergence to a non-optimum.  This is seen in the upper

401: simulations of \fig{runs}.  The minimal total regret in \fig{totals}

402: is for an algorithm injecting slightly less query than $\stderr

403: \hat{x}^*_t$.  This is due to the slight additional leverage caused by

404: fluctuation of the estimate $\hat{x}^*_t$ over time.

405:

406: Some procedures used in practise for problems of this character appear

407: to attempt to exceed the convergence bound established here, for

408: instance in medical treatment optimisation.  The above bounds should

409: serve as a caution concerning the ease with which a seemingly

410: reasonable optimisation procedure can converge to a non-optimum.  In

411: the setting considered here, when insufficient query variance is used

412: convergence to a non-optimum occurs, and standard statistical analysis

413: of the ongoing measurements will fail to give any hint of a problem.

414: Query variability must be injected when the setting itself requires

415: it, rather than only in response to empirical signs of premature

416: convergence.

417:

418: In business, the best selling price (which is not subject to the above

419: constraint, as noisy \emph{gradient} information is available) should

420: be faster to estimate than the supply or demand curves, which seem

421: potentially subject to this bound.  This would argue that firms that

422: set their prices by first estimating supply and demand curves may be

423: at a disadvantage against those that set prices directly.  More

424: speculatively, regulatory regimes have surprising variability

425: considering that all are designed to further similar goals.  Legal

426: systems have similar diversity.  The ultimate cause of this

427: variability may be the intrinsic difficulty of gradient-free noisy

428: query optimisation.  Even more speculatively, sexual selection for

429: adaptive traits may provide a proxy for gradient information, thus

430: speeding evolution.

431:

432: \subsection*{Acknowledgements}

433:

434: Supported by Science Foundation Ireland grant 00/PI.1/C067.  Thanks to

435: Tony Zador, Ken Duffy, and Susanna Still for helpful comments.

436:

437: \renewcommand{\bibsection}[0]{\subsection*{References}}

438: \setlength{\bibsep}{1ex}

439: \setlength{\bibhang}{0.75em}

440: \bibliographystyle{abbrvnat}    % apalike plainnat unsrtnat

441: \bibliography{abb-abbr,boltzmann}

442:

443: \end{document}

444:

445: %%% Local Variables:

446: %%% tex-command: "TEXINPUTS=:figures latex"

447: %%% tex-bibtex-command: "BIBINPUTS=../bib bibtex -terse"

448: %%% ispell-local-dictionary: "british"

449: %%% End:

450:

451: % LocalWords: variational tradeoff

452: