1: \documentclass[10pt,twocolumn,letterpaper,twoside]{article} % a4paper
2: %%% CVS version control block - do not edit manually
3: %%% $RCSfile: p.tex,v $
4: %%% $Revision: 1.64.2.3 $
5: %%% $Date: 2005/11/25 15:56:58 $
6: %%% $Source: /home/cvs/papers/query-max/p.tex,v $
7:
8: \usepackage{comment}
9: \usepackage{amsmath}
10: \usepackage{amsfonts}
11: \usepackage[headings,in]{fullpage}
12: \usepackage[dvips]{graphicx}
13: \usepackage{psfrag}
14: \usepackage[nice]{nicefrac}
15: \usepackage{newcent} % bookman times
16: \usepackage[round]{natbib}
17: \usepackage{url}
18: \urlstyle{same}
19: %%\usepackage{fancyhdr}
20: %%\usepackage[today,short]{rcsinfo}
21: %%\rcsInfo $Id: p.tex,v 1.64.2.3 2005/11/25 15:56:58 bap Exp $
22:
23: \DeclareMathOperator{\var}{var}
24: \DeclareMathOperator{\stderr}{stderr}
25: \newcommand{\bigfrac}[2]{\frac{\displaystyle #1}{\displaystyle #2}}
26: \newcommand{\ifrac}[2]{({#1}/{#2})}
27: \newcommand{\bigsum}[0]{\displaystyle\sum}
28: \providecommand{\abs}[1]{\lvert#1\rvert}
29: \newcommand{\ie}[0]{\emph{i.e.}}
30: \newcommand{\eq}[1]{Eq.~\ref{eq:#1}}
31: \newcommand{\fig}[1]{Fig.~\ref{fig:#1}}
32: \DeclareMathOperator{\noise}{\mathcal{N}}
33:
34: %%% Allow figures on page/columns with just a little regular text
35: \renewcommand\topfraction{.99} %1
36: \renewcommand\bottomfraction{.99}%1
37: \renewcommand\textfraction{.01} %0
38: \renewcommand\floatpagefraction{.99}
39: \setcounter{totalnumber}{50}
40: \setcounter{topnumber}{50}
41: \setcounter{bottomnumber}{50}
42:
43: \newlength{\gwidth}
44: \setlength{\gwidth}{0.49\textwidth}
45:
46: \graphicspath{
47: {figures/}
48: {/home/barak/src/papers/query-max/figures/}
49: {figs/}
50: {figs/home/barak/src/papers/query-max/figures/}
51: {./} % for, eg, arXiv
52: }
53: \DeclareGraphicsExtensions{.eps,.eps.gz,.jpg,.gif,.png}
54: \DeclareGraphicsRule{.eps}{eps}{.eps}{}
55: \DeclareGraphicsRule{.eps.gz}{eps}{.eps.gz.bb}{}
56: %\DeclareGraphicsRule{.png}{eps}{.png.bb}{`convert #1 eps:-}
57: %\DeclareGraphicsRule{*}{eps}{.bb}{`convert #1 eps:-}
58:
59: \setlength{\parskip}{2ex}
60: \setlength{\parindent}{0ex}
61:
62: \title{Bounds on Query Convergence}
63:
64: \author{\textbf{Barak A. Pearlmutter}\thanks{Hamilton Institute, NUI
65: Maynooth, Co.\ Kildare, Ireland.}}
66:
67: \date{\today\\\small (CVS: \rcsInfoFile\ \rcsInfoRevision)}
68: \date{}
69:
70: \pagestyle{plain} % fancy
71: %%\fancyhf[LH]{\emph{Bounds on Query Convergence}}
72: %%\fancyhf[RH]{\emph{Pearlmutter}}
73:
74: \begin{document}
75: \maketitle
76: \thispagestyle{empty}
77:
78: \begin{abstract}
79: The problem of finding an optimum using noisy evaluations of a smooth
80: cost function arises in many contexts, including economics, business,
81: medicine, experiment design, and foraging theory. We derive an
82: asymptotic bound
83: \begin{math}
84: E[ (x_t-x^*)^2 ] \geq O(t^{-1/2})
85: \end{math}
86: on the rate of convergence of a sequence $(x_0, x_1, \ldots)$
87: generated by an unbiased feedback process observing noisy evaluations
88: of an unknown quadratic function maximised at $x^*$. The bound is
89: tight, as the proof leads to a simple algorithm which meets it. We
90: further establish a bound on the total regret,
91: \begin{math}
92: E\bigl[ \sum_{\tau=1}^{t} (x_{\tau} - x^*)^2 \bigr] \geq O(t^{1/2}) .
93: \end{math}
94: These bounds may impose practical limitations on an agent's
95: performance, as $O(\epsilon^{-4})$ queries are made before the queries
96: converge to $x^*$ with $\epsilon$ accuracy.
97: \end{abstract}
98:
99: \section{Introduction}
100:
101: Finding an input $x$ to a system so as to optimise some property
102: $f(x)$ of the system's output, using only noisy measurements, is a
103: ubiquitous problem. For instance, in medicine $x$ might be a drug
104: dosage and $f(x)$ the probability of a successful outcome; in business
105: $x$ might be the price set by a manufacturer and $f(x)$ the consequent
106: profit; in game theory $x$ might be a strategy and $f(x)$ its return;
107: and in evolutionary theory $x$ might be the brightness of a bird's
108: plumage and $f(x)$ the consequent reproductive success.
109:
110: When the measurements of $f(x)$ are noise-free this is a classical
111: optimisation problem, as studied by Gauss. Optimisation theory
112: remains to this day a productive branch of applied mathematics. In
113: general, the assumption is made that the function to be optimised
114: takes on a simplified form in the neighbourhood of its optimum---most
115: often, quadratic. The criterion by which we evaluate such algorithms
116: is typically the convergence rate of its estimate of the location of
117: the optimum, although the complexity of the algorithm itself can also
118: be a consideration.
119:
120: Here we consider a situation in which the measurements of the function
121: are assumed to be noisy. A similar situation in which noisy
122: measurements of the gradient are available is studied in stochastic
123: gradient optimisation \citep{ROBBINS-MONRO51a, LJUNG77,
124: WIDROW-MCCOOL-LARIMORE-JOHNSON79}. Here however we assume that
125: gradient information is not available. We further assume that we are
126: interested not in our \emph{estimate} of the optimum converging as
127: rapidly as possible, but rather in the \emph{queries themselves}
128: converging to the optimum as rapidly as possible. As a practical
129: matter, the convergence of the queries themselves is important when
130: the function $f(x)$ is a measure of consequence, and making a
131: measurement at $x$ has an actual expected cost of $f(x)$, as in
132: measuring the survival rate of a medical treatment or the return of an
133: economic decision.
134:
135: Gradient information would make this problem much easier. For
136: illustration, consider two closely related optimisation problems. In
137: each, an inaccurate rifle with unknown bias can be swivelled
138: horizontally, and we wish to swivel it so as to maximise the
139: probability of hitting a small target. Due to the inaccuracy of the
140: riffle and the small target size, we are unlikely to hit the target
141: even when the rifle is aimed optimally. In one situation, we know
142: after each shot whether the bullet went to the left or the right of
143: the target. In the other situation, we know only whether the bullet
144: hit the target. Knowing whether the bullet went to the right or the
145: left of the target corresponds to having an estimate of the gradient,
146: and allows rapid convergence to the correct position by simply making
147: successively smaller adjustments after each shot away from the side to
148: which the bullet missed. But without this gradient information, it is
149: difficult to know in which direction to adjust the aim in response to
150: a miss. In fact, a single miss in isolation does not seem of any help
151: in improving the aim. It is our goal here to precisely characterise
152: the difficulty of such situations.
153:
154: \section{Proof Sketch}
155:
156: We construct an inequality which establishes a lower bound on the rate
157: of convergence of the queries $x_t$ to the optimum $x^*$. The
158: inequality follows from the observation that if the queries $x_t$ are
159: more spread out, the estimate of the optimum $x^*$ will have less
160: uncertainty. This relationship, in which faster convergence of the
161: queries leads to slower convergence of the estimate of $x^*$, is
162: quantified using the statistical notion of the leverage of the data,
163: which limits the accuracy of an estimate of a slope. This gives a
164: lower bound on the speed with which the queries $x_t$ can converge to
165: $x^*$. Violation of the bound would imply a contradiction: that the
166: queries converge to the optimum faster than does the best estimate of
167: the optimum.
168:
169: \section{Detailed Derivation}
170:
171: We consider an unbiased feedback system which uses noisy measurements
172: to find the $x$ which maximises $f(x)$, where $f(x)$ is locally
173: quadratic about its maximum $x^*$. To simplify the derivation we will
174: assume that $f(x)$ is not merely locally but globally quadratic
175: \begin{equation}
176: f(x) = - a x^2 + b x + c = -a (x - x^*)^2 + f(x^*)
177: \end{equation}
178: that the quadratic coefficient $a>0$ is known leaving unknown only the
179: linear and constant terms $b$ and $c$, and that each noisy
180: measurements of $f(x)$ is corrupted by zero-mean i.i.d.\ additive
181: noise of variance $\sigma^2$.
182:
183: Let $x_0, x_1, \ldots$ be the sequence of points evaluated. We
184: establish the following bound:
185:
186: \newtheorem{theorem}{Theorem}
187: %\newtheorem{proof}{Proof}
188: \newtheorem{corollary}{Corollary}
189:
190: \begin{theorem} \label{theorem:main}
191: For sufficiently large $t$ and an unbiased feedback process that
192: calculates $x_t$ using information available prior to $t$,
193: \begin{equation} \label{eq:main_thm} \displaystyle
194: E[ (x_t - x^*)^2 ] \geq \frac{\sigma}{\sqrt{8} \, a} \, t^{-1/2}
195: \end{equation}
196: \end{theorem}
197:
198: \textbf{Proof:}
199: %
200: Since $a$ is known we can add $a x_t^2$ to the measurements and fit
201: $b$ and $c$ to the resulting noisy line. The variance of $\hat{b}_t$,
202: the best unbiased estimate of $b$ given measurements made prior to
203: time $t$, is limited by the Cram\'er-Rao bound which depends on the
204: level of measurement noise and the leverage about the sample mean
205: $\overline{x}_t = (x_0 + x_1 + \cdots + x_{t-1})/t$,
206: \begin{equation}
207: \var \hat{b}_t
208: = \bigfrac{ \sigma^2 }
209: { \sum_{\tau<t} (x_{\tau} - \overline{x}_t)^2 } .
210: \end{equation}
211:
212: This leverage is bounded by the leverage about any point; here we
213: choose $x^*$, the desired point of convergence,
214: \begin{equation}
215: \sum_{\tau<t} (x_{\tau} - \overline{x}_t)^2
216: \leq \sum_{\tau<t} (x_{\tau} - x^*)^2
217: \end{equation}
218: so
219: \begin{equation}
220: \var \hat{b}_t
221: \geq \bigfrac{\sigma^2}{ \sum_{\tau<t} (x_{\tau} - x^*)^2 }
222: \end{equation}
223: Because $x^* = b/2a$ the variance of an estimate of $x^*$ is related to
224: the variance of an estimate of $b$,
225: \begin{equation}
226: \var \hat{x}^*_t = \frac{1}{4a^2} \var \hat{b}_t
227: \end{equation}
228: where $\hat{x}^*_t$ is the best unbiased estimate of $x^*$ given
229: measurements made prior to $t$. By definition $\hat{x}_t^*$ cannot be
230: a worse estimate of $x^*$ than is $x_t$, and we have already seen a
231: bound on the quality of the estimate $\hat{x}_t^*$, so
232: \begin{equation} \label{eq:two_sided}
233: E[ (x_t - x^*)^2 ]
234: \geq \var \hat{x}^*_t
235: \geq \bigfrac{\sigma^2}{4a^2 \sum_{\tau<t} (x_{\tau} - x^*)^2 }
236: \end{equation}
237: where the expectation $E[\cdot]$ is taken over realisations of the
238: measurement noise.
239:
240: We now assume\footnotemark\ that $x_t$ convergences polynomially,
241: $E[(x_t - x^*)^2] = (k t^r)^2$, and substitute this above to find $r$
242: and $k$. The leverage about $x^*$ can be evaluated,
243: %
244: \footnotetext{If the fastest possible convergence bound were not of
245: this form then we would obtain a valid bound, but not a tight one.
246: However, we constructively show that the bound obtained is tight.}
247: \begin{equation} \label{eq:form}
248: E\Bigl[ \sum_{\tau<t} (x_{\tau} - x^*)^2 \Bigr]
249: = k^2 \sum_{\tau<t} \tau^{2r}
250: = \frac{k^2}{1+2r} t^{1+2r}
251: \end{equation}
252: \eq{form} can be substituted into the two-sided bound on
253: $\var\hat{x}^*_t$ in \eq{two_sided}, yielding
254: \begin{gather}
255: k^2 t^{2r}
256: = E[ (x_t - x^*)^2 ]
257: \geq \var \hat{x}^*_t
258: \geq \frac{\sigma^2 (1+2r)}{4 k^2 a^2} t^{-(1+2r)}
259: \nonumber\\
260: \intertext{or}
261: k^4 \geq \frac{\sigma^2 (1+2r)}{4a^2} t^{-(1+4r)}
262: \end{gather}
263: This can only be satisfied if the right hand side is bounded, which
264: implies that $r \geq -1/4$, and hence
265: \begin{equation}
266: E[(x_t - x^*)^2] \geq O(t^{-1/2})
267: \end{equation}
268: The most aggressive convergence is for $r=-1/4$, at which point
269: equality is achieved when $k^2 = \sigma/(\sqrt{8} \, a)$.
270: Substituting yields \eq{main_thm}.
271:
272: \begin{corollary}[Bound on Instantaneous Regret]
273: The expected instantaneous regret (loss incurred at time $t$ due to
274: ignorance) of an unbiased online optimiser is bounded below in
275: expectation by
276: \begin{equation}
277: E[f(x^*) - f(x_t)] \geq \frac{\sigma}{\sqrt{8}} t^{-1/2}
278: \end{equation}
279: \end{corollary}
280:
281: \textbf{Proof:} Note that $f(x^*) - f(x) = a (x - x^*)^2$ and
282: substitute into Theorem \ref{theorem:main}.
283:
284: \begin{corollary}[Bound on Total Regret]
285: The total regret prior to time $t$, defined by
286: \begin{math}
287: R_t = \sum_{\tau<t} f(x^*) - f(x_{\tau}) ,
288: \end{math}
289: incurred by an unbiased feedback process is bounded below in
290: expectation by
291: \begin{equation}
292: E[R_t] \geq \frac{\sigma}{\sqrt{2}} t^{1/2}
293: \end{equation}
294: \end{corollary}
295:
296: \textbf{Proof:} Summation of the bound on instantaneous regret.
297:
298: \textbf{Note:} The expected regret bound is independent of the
299: constant of curvature $a$, whose effect cancels itself out in the
300: analysis. This is necessarily the case, because we could define
301: $\tilde{f}(x) = f(100 \, x)$ and an attempt to optimise $\tilde{f}(x)$
302: should yield the same regret as an attempt to optimise $f(x)$, despite
303: their differing curvatures.
304:
305: \begin{theorem}[Optimal Algorithm] \label{thm:alg}
306: The stochastic algorithm
307: \begin{equation}
308: x_t = \hat{x}^*_t + \noise\bigl((\stderr \hat{x}^*_t)^p\bigr)
309: \end{equation}
310: is unbiased and with $p=2$ achieves $E[(x_t - x^*)^2] \sim
311: \ifrac{\sqrt{2} \, \sigma}{a} \, t^{-1/2}$ and $E[R_t] \sim \sigma
312: \sqrt{8 t\,}$, where $\noise(\varsigma^2)$ is zero-mean
313: $\varsigma^2$-variance i.i.d.\ noise and $\stderr \hat{x}^*_t$ is the
314: standard error of the unbiased estimator $\hat{x}^*_t$.
315: \end{theorem}
316:
317: \textbf{Proof:} The algorithm involves only unbiased estimates and is
318: therefore unbiased.
319:
320: The inequalities above become equalities when
321: \begin{equation}
322: x_t = \hat{x}^*_t + \noise\bigl(\sqrt{2} \, \sigma a \, t^{-1/2} \bigr)
323: \end{equation}
324: which has the same injected variance (up to absorbed constant factors)
325: as in the proposed algorithm.
326:
327: \textbf{Note:} The existence of this algorithm implies that the
328: earlier bounds are tight. Interestingly, the algorithm does not
329: require knowledge of $a$ or $\sigma$, which are used only in the
330: analysis. Due to the statistics of the situation, $\stderr
331: \hat{x}^*_t$ scales appropriately with $a$ and $\sigma$.
332:
333: \begin{figure*}[t!]
334: \psfrag{time}[c][c]{$t$}
335: \psfrag{R(t)}[c][c]{$R_t$}
336: \psfrag{ 0}[r][r]{0}
337: \psfrag{ 10000}[r][c]{$10^4$}
338: \psfrag{ 200}[r][r]{200}
339: \psfrag{ 500}[r][r]{500}
340: \psfrag{ 1000}[r][r]{1000}
341: \psfrag{no noise}[c][c]{Greedy: $x_t = \hat{x}^*_t$}
342: \psfrag{p=0.8}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{0.8})$}
343: \psfrag{p=2}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{2})$}
344: \psfrag{p=3.6}[c][c]{$x_t = \hat{x}^*_t + \noise((\stderr\hat{x}^*_t)^{3.6})$}
345: \includegraphics[width=\gwidth]{plot-3_6}\hfill%
346: \includegraphics[width=\gwidth]{plot-0}\\[2ex]
347: \includegraphics[width=\gwidth]{plot-2}\hfill%
348: \includegraphics[width=\gwidth]{plot-0_8}
349: \caption{Total regret as a function of time for 100 overlaid runs of
350: the algorithm of Theorem~\ref{thm:alg} (bottom left) which optimally
351: trades off exploration and exploitation; with $p=0.8$ for more query
352: noise (bottom right) resulting in less between-run variation but
353: more regret; with $p=3.6$ for less query noise (top left) resulting
354: in more between-run variation; and for the greedy strategy, zero
355: query noise (top right) in which runs rapidly converge to incorrect
356: estimates. All runs used $\sigma^2=a=1$, $b=c=0$, and were
357: initialised with two queries at $x = x^* \pm 1$.}
358: \label{fig:runs}
359: \end{figure*}
360:
361: \begin{figure*}[t!]
362: \centerline{\input{totals.itex}}
363: %\includegraphics[width=\columnwidth]{plot-totals}
364: \caption{Bar graph (log scale) of total regret after $10^6$ queries,
365: averaged over 100 runs, for the algorithm of Theorem~\ref{thm:alg}
366: with $\sigma=1$ and $a=1$. Bars shown for values of $p$ both above
367: and below the optimal $p=2$, and also for the greedy algorithm of
368: zero injected noise. Risers show sample standard deviations.}
369: \label{fig:totals}
370: \end{figure*}
371:
372: \section{Discussion}
373:
374: Although the above theorems all assume unbiased estimates, integration
375: of prior information would, assuming that the prior is smooth, only
376: change an initial transient response of the system, leaving the
377: asymptotic behaviour unchanged. The limits on regret would change by
378: only a small additive constant whose value would dependant upon the
379: details of the prior.
380:
381: The above exploration/exploitation tradeoff and bound holds when using
382: noisy measurements and the cost of an evaluation is the value of the
383: function being optimised. The result is robust, in that small changes
384: to the model (a cost function quadratic only in the neighbourhood of
385: the optimum, for instance) will not change their character.
386:
387: However a related situation, finding the zero $x^*$ of a linear
388: function using noisy measurements where the expected loss of a
389: measurement $x_t$ is quadratic in $x_t - x^*$, has a surprisingly
390: different result. In this matching-shoulders lob-pass case formalised
391: by \citet{ABE-TAKEUCHI93A} based on the foraging theory question posed
392: by \citet{HERRNSTEIN90A}, a convergence rate of $E[(x_t - x^*)^2] =
393: O(t^{-1})$ and thus an expected regret of $E[R_t] = O(\log t)$ can be
394: achieved \citep{KILIAN-ETAL94A, HIRAOKA-AMARI98A,
395: TAKEUCHI-ETAL-2000a}. This is because the measurements in that
396: setting serve the purpose of gradient information.
397:
398: Procedures which do not insert sufficient variability into their
399: queries acquire only finite leverage, resulting (with probability one)
400: in convergence to a non-optimum. This is seen in the upper
401: simulations of \fig{runs}. The minimal total regret in \fig{totals}
402: is for an algorithm injecting slightly less query than $\stderr
403: \hat{x}^*_t$. This is due to the slight additional leverage caused by
404: fluctuation of the estimate $\hat{x}^*_t$ over time.
405:
406: Some procedures used in practise for problems of this character appear
407: to attempt to exceed the convergence bound established here, for
408: instance in medical treatment optimisation. The above bounds should
409: serve as a caution concerning the ease with which a seemingly
410: reasonable optimisation procedure can converge to a non-optimum. In
411: the setting considered here, when insufficient query variance is used
412: convergence to a non-optimum occurs, and standard statistical analysis
413: of the ongoing measurements will fail to give any hint of a problem.
414: Query variability must be injected when the setting itself requires
415: it, rather than only in response to empirical signs of premature
416: convergence.
417:
418: In business, the best selling price (which is not subject to the above
419: constraint, as noisy \emph{gradient} information is available) should
420: be faster to estimate than the supply or demand curves, which seem
421: potentially subject to this bound. This would argue that firms that
422: set their prices by first estimating supply and demand curves may be
423: at a disadvantage against those that set prices directly. More
424: speculatively, regulatory regimes have surprising variability
425: considering that all are designed to further similar goals. Legal
426: systems have similar diversity. The ultimate cause of this
427: variability may be the intrinsic difficulty of gradient-free noisy
428: query optimisation. Even more speculatively, sexual selection for
429: adaptive traits may provide a proxy for gradient information, thus
430: speeding evolution.
431:
432: \subsection*{Acknowledgements}
433:
434: Supported by Science Foundation Ireland grant 00/PI.1/C067. Thanks to
435: Tony Zador, Ken Duffy, and Susanna Still for helpful comments.
436:
437: \renewcommand{\bibsection}[0]{\subsection*{References}}
438: \setlength{\bibsep}{1ex}
439: \setlength{\bibhang}{0.75em}
440: \bibliographystyle{abbrvnat} % apalike plainnat unsrtnat
441: \bibliography{abb-abbr,boltzmann}
442:
443: \end{document}
444:
445: %%% Local Variables:
446: %%% tex-command: "TEXINPUTS=:figures latex"
447: %%% tex-bibtex-command: "BIBINPUTS=../bib bibtex -terse"
448: %%% ispell-local-dictionary: "british"
449: %%% End:
450:
451: % LocalWords: variational tradeoff
452: