cs0504078/cs0504078
1: 
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: % Adaptive Online Prediction by Following the Perturbed Leader%
4: %%      Marcus Hutter & Jan Poland: Start: December 2003     %%
5: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6: 
7: \documentclass[12pt,twoside]{article}
8: \usepackage{latexsym}
9: \topmargin=-1cm  \oddsidemargin=5mm \evensidemargin=5mm
10: \textwidth=15cm \textheight=22cm \unitlength=1mm
11: \sloppy\lineskip=0pt
12: 
13: %-------------------------------%
14: %   Macro-Definitions           %
15: %-------------------------------%
16: \def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
17: \def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip=  4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
18: \def\textmuskip{\thinmuskip= 0mu                    \medmuskip=  1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
19: %\def\dispmuskip{}\def\textmuskip{}    %normal math-spacing
20: \textmuskip
21: \def\beq{\dispmuskip\begin{equation}}    \def\eeq{\end{equation}\textmuskip}
22: \def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
23: \def\bqa{\dispmuskip\begin{eqnarray}}    \def\eqa{\end{eqnarray}\textmuskip}
24: \def\bqan{\dispmuskip\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}\textmuskip}
25: \newtheorem{theorem}{Theorem}
26: \newtheorem{corollary}[theorem]{Corollary}
27: \newtheorem{lemma}[theorem]{Lemma}
28: \newtheorem{definition}[theorem]{Definition}
29: \newenvironment{keywords}{\centerline{\bf\small
30: Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
31: \def\citet{\cite}\def\citep{\cite}\def\citealt{\cite}\def\citeauthor{\cite}
32: \def\myparskip{\vspace{1.5ex plus 0.5ex minus 0.5ex}\noindent}
33: \def\paragraph#1{\myparskip{\bfseries\boldmath{#1.}}}
34: \def\paradot#1{\myparskip{\bfseries\boldmath{#1.}}}
35: \def\paranodot#1{\myparskip{\bfseries\boldmath{#1}}}
36: \def\eps{\varepsilon}
37: \def\nq{\hspace{-1em}}
38: \def\qed{\hspace*{\fill}$\Box\quad$\\}
39: \def\odt{{\textstyle{1\over 2}}}
40: \def\v{\boldsymbol}
41: \def\p{{\scriptscriptstyle+}}
42: \def\n{{n}}
43: \def\t{\pi}
44: \def\pin{{\scriptstyle\Pi}}
45: \def\Var{{\mbox{Var}}}
46: \def\Cov{{\mbox{Cov}}}
47: \def\SetR{I\!\!R}
48: \def\SetN{I\!\!N}
49: \def\N{{\cal N}}
50: \def\D{{\cal D}}
51: \def\S{{\cal S}}
52: \def\E{{\cal E}}
53: \def\X{{\cal X}}                        % input/perception set/alphabet
54: \def\Y{{\cal Y}}                        % output/action set/alphabet
55: \def\qmbox#1{{\quad\mbox{#1}\quad}}
56: \def\scp{{\scriptscriptstyle^{\,\circ}}}
57: \def\sooe{{\textstyle{1\over\eta}}}
58: \def\FPL{\text{FPL} }
59: \def\IFPL{\text{IFPL} }
60: 
61: \def\leqt{_{1:t}}
62: \def\leqtj{_{1:{t_j}}}
63: \def\leqtjj{_{1:{t_{j-1}}}}
64: \def\leqT{_{1:T}}
65: \def\ltT{_{<T}}
66: \def\ltt{_{<t}}
67: \def\lttj{_{<{t_j}}}
68: \def\lttjj{_{<{t_{j-1}}}}
69: \def\leqn{_{1:n}}
70: \def\leqss{_{1:s}}
71: \def\smin{^{min}}
72: 
73: \def\leqs#1{\stackrel {#1} \leq}
74: \def\text#1{\mbox{\scriptsize{#1}}}
75: \def\e{{\rm e}}                        % natural e
76: 
77: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
78: %                      T i t l e - P a g e                      %
79: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
80: 
81: \begin{document}
82: \title{\vskip -10mm\normalsize\sc Technical Report \hfill IDSIA-10-05
83: \vskip 2mm\bf\Large\hrule height5pt \vskip 6mm
84: Adaptive Online Prediction by \\ Following the Perturbed Leader
85: \vskip 6mm \hrule height2pt \vskip 5mm}
86: \author{{\bf Marcus Hutter} and {\bf Jan Poland}\\[3mm]
87: \normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland%
88: \thanks{This work was supported by SNF grant 2100-67712.02.\newline\hspace*{3.6ex}
89: A shorter version appeared in the proceedings of the ALT 2004 conference \citep{Hutter:04expert}.}\\
90: \normalsize \{marcus,jan\}@idsia.ch, \ http://www.idsia.ch/$^{_{_\sim}}\!$\{marcus,jan\} }
91: \date{14 April 2005}
92: \maketitle
93: 
94: \begin{abstract}%
95: When applying aggregating strategies to Prediction with Expert
96: Advice, the learning rate must be adaptively tuned. The
97: natural choice of $\sqrt{\mbox{complexity/current loss}}$
98: renders the analysis of Weighted Majority derivatives quite
99: complicated. In particular, for arbitrary weights there have
100: been no results proven so far. The analysis of the alternative
101: ``Follow the Perturbed Leader'' (FPL) algorithm from Kalai and
102: Vempala (2003) based on Hannan's algorithm is easier. We
103: derive loss bounds for adaptive learning rate and both finite
104: expert classes with uniform weights and countable expert
105: classes with arbitrary weights. For the former setup, our loss
106: bounds match the best known results so far, while for the
107: latter our results are new.
108: \end{abstract}
109: 
110: \begin{keywords}
111: Prediction with Expert Advice,
112: Follow the Perturbed Leader,
113: general weights,
114: adaptive learning rate,
115: adaptive adversary,
116: hierarchy of experts,
117: expected and high probability bounds,
118: general alphabet and loss,
119: online sequential prediction.
120: \end{keywords}
121: 
122: \newpage
123: %------------------------------%
124: %      Table of Contents       %
125: %------------------------------%
126: \begin{quote}\begin{quote}
127: \def\contentsname{\normalsize \hfil Contents \hfil}
128: {\parskip=-2.5ex\tableofcontents}
129: \end{quote}\end{quote}
130: 
131: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
132: \section{Introduction}\label{secInt}
133: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
134: 
135: %-------------------------------%
136: %\paradot{Prediction with Expert Advice}
137: %-------------------------------%
138: In Prediction with Expert Advice (PEA) one considers an
139: ensemble of sequential predictors (experts). A master
140: algorithm is constructed based on the historical performance
141: of the predictors. The goal of the master algorithm is to
142: perform nearly as well as the best expert in the class, on any
143: sequence of outcomes. This is achieved by making (randomized)
144: predictions close to the better experts.
145: 
146: %-------------------------------%
147: %\paradot{Historical Survey}
148: %-------------------------------%
149: PEA theory has rapidly developed in the recent past.
150: Starting with the Weighted Majority (WM) algorithm of
151: \citet{Littlestone:89,Littlestone:94} and the aggregating
152: strategy of \citet{Vovk:90}, a vast variety of different
153: algorithms and variants have been published. A key parameter
154: in all these algorithms is the \emph{learning rate}. While
155: this parameter had to be fixed in the early algorithms such as
156: WM, \citet{Cesa:97} established the so-called doubling trick
157: to make the learning rate coarsely adaptive. A little later,
158: incrementally adaptive algorithms were developed by
159: \citet{Auer:00,Auer:02pea,Yaroshinsky:04,Gentile:03}, and
160: others. In Section \ref{secConc}, we will compare our results
161: with these works more in detail. Unfortunately, the loss bound
162: proofs for the incrementally adaptive WM variants are quite
163: complex and technical, despite the typically simple and
164: elegant proofs for a static learning rate.
165: 
166: %-------------------------------%
167: %\paradot{Adaptive Learning Rate}
168: %-------------------------------%
169: The complex growing proof techniques also had another consequence:
170: While for the original WM algorithm, assertions are proven for
171: countable classes of experts with arbitrary weights, the modern
172: variants usually restrict to finite classes with uniform weights
173: (an exception being \citet{Gentile:03}, see the discussion
174: section). This might be sufficient for many practical purposes but
175: it prevents the application to more general classes of predictors.
176: Examples are extrapolating (=predicting) data points with the help
177: of a polynomial (=expert) of degree $d=1,2,3,...$ --or-- the (from
178: a computational point of view largest) class of all computable
179: predictors. Furthermore, most authors have concentrated on
180: predicting \emph{binary} sequences, often with the 0/1 loss for
181: $\{0,1\}$-valued and the absolute loss for $[0,1]$-valued
182: predictions. Arbitrary losses are less common. Nevertheless, it is
183: easy to abstract completely from the predictions and consider the
184: resulting losses only. Instead of predicting according to a
185: ``weighted majority'' in each time step, one chooses one
186: \emph{single} expert with a probability depending on his past
187: cumulated loss. This is done e.g.\ by \citet{Freund:97}, where an
188: elegant WM variant, the Hedge algorithm, is analyzed.
189: 
190: %-------------------------------%
191: %\paradot{Follow the Perturbed Leader}
192: %-------------------------------%
193: A different, general approach to achieve similar results is
194: ``Follow the Perturbed Leader'' (FPL). The principle dates
195: back to as early as 1957, now called Hannan's algorithm
196: \citep{Hannan:57}. In 2003, Kalai and Vempala published a
197: simpler proof of the main result of Hannan and also succeeded
198: to improve the bound by modifying the distribution of the
199: perturbation\nocite{Kalai:03}. The resulting algorithm (which
200: they call FPL*) has the same performance guarantees as the
201: WM-type algorithms for fixed learning rate, save for a factor
202: of $\sqrt 2$. A major advantage we will discover in this work
203: is that its analysis remains easy for an adaptive learning
204: rate, in contrast to the WM derivatives. Moreover, it
205: generalizes to online decision problems other than PEA.
206: 
207: %-------------------------------%
208: %\paradot{What' new}
209: %-------------------------------%
210: In this work, we study the FPL algorithm for PEA. The problems of
211: WM algorithms mentioned above are addressed: Bounds on the
212: cumulative regret of the standard form $\sqrt{kL}$ (where $k$ is
213: the complexity and $L$ is the cumulative loss of the best expert
214: in hindsight) are shown for countable expert classes with
215: arbitrary weights, adaptive learning rate, and arbitrary losses.
216: Regarding the adaptive learning rate, we obtain proofs that are
217: simpler and more elegant than for the corresponding WM algorithms.
218: (In particular, the proof for a self-confident choice of the
219: learning rate, Theorem~\ref{thFPLLDynamic}, is less than half a
220: page.) Further, we prove the first loss bounds for \emph{arbitrary
221: weights} and adaptive learning rate. In order to obtain the
222: optimal $\sqrt{kL}$ bound in this case, we will need to introduce
223: a hierarchical version of FPL, while without hierarchy we show a
224: worse bound $k\sqrt{L}$. (For self-confident learning rate
225: together with uniform weights and arbitrary losses, one can prove
226: corresponding results for a variant of WM by adapting an argument
227: by \citealt{Auer:02pea}.)
228: 
229: %-------------------------------%
230: %\paradot{Online, worst case and probabilities}
231: %-------------------------------%
232: PEA usually refers to an \emph{online worst case} setting: $n$
233: experts that deliver sequential predictions over a time range
234: $t=1,\ldots,T$ are given. At each time $t$, we know the actual
235: predictions and the \emph{past} losses. The goal is to give a
236: prediction such that the overall loss after $T$ steps is ``not
237: much worse'' than the best expert's loss \emph{on any sequence of
238: outcomes}. If the prediction is deterministic, then an adversary
239: could choose a sequence which provokes maximal loss. So we have to
240: \emph{randomize} our predictions. Consequently, we ask for a
241: prediction strategy such that the \emph{expected} loss on any
242: sequence is small.
243: 
244: %-------------------------------%
245: %\paradot{Contents}
246: %-------------------------------%
247: This paper is structured as follows. In Section~\ref{secSetup}
248: we give the basic definitions. While \citeauthor{Kalai:03}
249: consider general online decision problems in
250: finite-dimensional spaces, we focus on online prediction tasks
251: based on a countable number of experts. Like \citet{Kalai:03}
252: we exploit the infeasible FPL predictor (IFPL) in our
253: analysis.
254: %
255: Sections~\ref{secIFPL} and \ref{secFFPL} derive the main
256: analysis tools. In Section~\ref{secIFPL} we generalize (and
257: marginally improve) the upper bound \citep[Lem.3]{Kalai:03} on
258: IFPL to arbitrary weights. The main difficulty we faced was to
259: appropriately distribute the weights to the various terms. For
260: the corresponding lower bound (Section~\ref{secLowFPL}) this
261: is an open problem.
262: %
263: In Section~\ref{secFFPL} we exploit our restricted setup to
264: significantly improve \citep[Eq.(3)]{Kalai:03} allowing for
265: bounds logarithmic rather than linear in the number of
266: experts.
267: %
268: The upper and lower bounds on IFPL are combined to derive
269: various regret bounds on FPL in Section~\ref{secBounds}.
270: Bounds for static and dynamic learning rate in terms of the
271: sequence length follow straight-forwardly. The proof of our
272: main bound in terms of the loss is much more elegant than the
273: analysis of previous comparable results.
274: %
275: Section~\ref{secHierarchy} proposes a novel hierarchical procedure
276: to improve the bounds for non-uniform weights.
277: %
278: In Section~\ref{secLowFPL}, a lower bound is established.
279: %
280: In Section~\ref{secAdap}, we consider the case of independent
281: randomization more seriously. In particular, we show that the
282: derived bounds also hold for an adaptive adversary.
283: %
284: Section~\ref{secMisc} treats some additional issues, including
285: bounds with high probability, computational aspects, deterministic
286: predictors, and the absolute loss.
287: %
288: Finally, in Section~\ref{secConc} we discuss our results, compare
289: them to references, and state some open problems.
290: 
291: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
292: \section{Setup and Notation}\label{secSetup}
293: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
294: 
295: %-------------------------------%
296: \paradot{Setup}
297: %-------------------------------%
298: Prediction with Expert Advice proceeds as follows. We are asked to
299: perform sequential predictions $y_t\in\Y$ at times $t=1,2,\ldots$.
300: At each time step $t$, we have access to the predictions
301: $(y_t^i)_{1\leq i\leq n}$ of $n$ experts $\{e_1,...,e_n\}$, where
302: the size of the expert pool is $n\in\SetN\cup\{\infty\}$. It is
303: convenient to use the same notation for finite ($n\in\SetN$) and
304: countably infinite ($n=\infty$) expert pool. After having made a
305: prediction, we make some observation $x_t\in\X$, and a Loss is
306: revealed for our and each expert's prediction. (E.g.\ the loss
307: might be 1 if the expert made an erroneous prediction and 0
308: otherwise. This is the 0/1 loss.) Our goal is to achieve a total
309: loss ``not much worse" than the best expert, after $t$ time steps.
310: 
311: We admit $n\in\SetN\cup\{\infty\}$ experts, each of which is
312: assigned a known complexity $k^i\geq 0$. Usually we require
313: $\sum_i\e^{-k^i}\leq 1$, which implies that the $k^i$ are valid
314: lengths of prefix code words, for instance $k^i=\ln n$ if
315: $n<\infty$ or $k^i=\odt+2\ln i$ if $n=\infty$. Each complexity
316: defines a weight by means of $\smash{\e^{-k^i}}$ and vice versa.
317: In the following we will talk of complexities rather than of
318: weights. If $n$ is finite, then usually one sets $k^i= \ln n$ for
319: all $i$; this is the case of \emph{uniform complexities/weights}.
320: If the set of experts is countably infinite ($n=\infty$), uniform
321: complexities are not possible. The vector of all complexities is
322: denoted by $k=(k^i)_{1\leq i\leq n}$. At each time $t$, each
323: expert $i$ suffers a loss\footnote{The setup, analysis and results
324: easily scale to $s_t^i\in[0,S]$ for $S>0$ other than 1.}
325: $s_t^i=$Loss$(x_t,y_t^i)\in[0,1]$, and $s_t=(s_t^i)_{1\leq i\leq
326: n}$ is the vector of all losses at time $t$. Let
327: $s\ltt=s_1+\ldots+s_{t-1}$ (respectively $s\leqt=s_1+\ldots+s_t$)
328: be the total past loss vector (including current loss $s_t$) and
329: $s\leqt\smin=\min_i\{s\leqt^i\}$ be the loss of the \emph{best
330: expert in hindsight (BEH)}. Usually we do not know in advance the
331: time $t\geq 0$ at which the performance of our predictions are
332: evaluated.
333: 
334: %-------------------------------%
335: \paradot{General decision spaces}
336: %-------------------------------%
337: The setup can be generalized as follows. Let $\S\subset\SetR^n$ be the
338: \emph{state space} and $\D\subset\SetR^n$ the \emph{decision
339: space}. At time $t$ the state is $s_t\in\S$, and a decision
340: $d_t\in\D$ (which is made before the state is revealed) incurs a
341: loss $d_t\!\scp s_t$, where ``$\scp$" denotes the inner product. This
342: implies that the loss function is \emph{linear} in the states.
343: Conversely, each linear loss function can be represented in this
344: way. The decision which minimizes the loss in state $s\in\S$ is
345: \beq\label{Mdef}
346:   M(s):=\arg\min_{d\in\D} \{d\scp s\}
347: \eeq
348: if the minimum exists. The application of this general framework
349: to PEA is straightforward: $\D$ is identified with the space of
350: all unit vectors $\E=\{e_i:1\leq i\leq n\}$, since a decision
351: consists of selecting a single expert, and $s_t\in[0,1]^n$, so
352: states are identified with losses. Only Theorems~\ref{thIFPL} and
353: \ref{thLowFPL} will be stated in terms of general decision space.
354: Our main focus is $\D=\E$. (Even for this special case, the scalar
355: product notation is not too heavy, but will turn out to be
356: convenient.) All our results generalize to the simplex
357: $\D=\Delta=\{v\in[0,1]^n:\sum_i v^i=1\}$, since the minimum of a
358: linear function on $\Delta$ is always attained on $\E$.
359: 
360: %-------------------------------%
361: \paradot{Follow the Perturbed Leader}
362: %-------------------------------%
363: Given $s\ltt$ at time $t$, an immediate idea to solve the expert
364: problem is to ``Follow the Leader'' (FL), i.e.\ selecting the
365: expert $e_i$ which performed best in the past (minimizes
366: $s\ltt^i$), that is predict according to expert $M(s\ltt)$. This
367: approach fails for two reasons. First, for $n=\infty$ the minimum
368: in (\ref{Mdef}) may not exist. Second, for $n=2$ and
369: $s={\,0\,1\,0\,1\,0\,1 \ldots \choose \frac{1}{2}0\,1\,0\,1\,0
370: \ldots}$, FL always chooses the wrong prediction \citep{Kalai:03}.
371: We solve the first problem by penalizing each expert by its
372: complexity, i.e.\ predicting according to expert $M(s\ltt+k)$. The
373: \emph{FPL (Follow the Perturbed Leader)} approach solves the
374: second problem by adding to each expert's loss $s\ltt^i$ a random
375: perturbation.
376: %
377: We choose this perturbation to be negative \emph{exponentially
378: distributed}, either independent in each time step or once and for
379: all at the very beginning at time $t=0$. The former choice is
380: preferable in order to protect against an adaptive adversary who
381: generates the $s_t$, and in order to get bounds with high
382: probability (Section~\ref{secMisc}). For the main analysis
383: however, the latter choice is more convenient. Due to linearity of
384: expectations, these two possibilities are equivalent when dealing
385: with {\it expected losses} (this is straightforward for oblivious
386: adversary, for adaptive adversary see Section~\ref{secAdap}), so
387: we can henceforth assume without loss of generality one initial
388: perturbation $q$.
389: 
390: %-------------------------------%
391: \paranodot{The FPL algorithm} is defined as follows:\\
392: %-------------------------------%
393: %
394: \hspace*{1cm}Choose random vector $q\stackrel{d.}{\sim}\exp$,
395:              i.e.\ $P[q^1...q^n]=\e^{-q^1}\cdot...\cdot\e^{-q^n}$ for $q\geq 0$.\\
396: \hspace*{1cm}For $t=1,...,T$\\
397: \hspace*{1cm}- Choose learning rate $\eta_t$.\\
398: \hspace*{1cm}- Output prediction of expert $i$ which minimizes $s_{<t}^i+(k^i-q^i)/\eta_t$.\\
399: \hspace*{1cm}- Receive loss $s_t^i$ for all experts $i$.
400: 
401: \vspace{1.5ex}\noindent Other than $s\ltt$, $k$ and $q$, FPL
402: depends on the \emph{learning rate} $\eta_t$. We will give choices
403: for $\eta_t$ in Section~\ref{secBounds}, after having established
404: the main tools for the analysis. The expected loss at time $t$ of
405: FPL is $\ell_t:=E\big[M(s_{<t}+{k-q\over\eta_t})\scp s_t\big]$.
406: The key idea in the FPL analysis is the use of an intermediate
407: predictor \emph{IFPL} (for \emph{Implicit or Infeasible FPL}).
408: IFPL predicts according to $M(s\leqt+\smash{k-q\over\eta_t})$,
409: thus under the knowledge of $s_t$ (which is of course not
410: available in reality). By
411: $r_t:=E\big[M(s_{1:t}+\smash{k-q\over\eta_t})\scp s_t\big]$ we
412: denote the expected loss of IFPL at time $t$. The losses of IFPL
413: will be upper-bounded by BEH in Section~\ref{secIFPL} and
414: lower-bounded by FPL in Section~\ref{secFFPL}. Note that our
415: definition of the FPL algorithm deviates from that of
416: \citeauthor{Kalai:03}. It uses an exponentially distributed
417: perturbation similar to their FPL$^*$ but one-sided and a
418: non-stationary learning rate like Hannan's algorithm.
419: 
420: %-------------------------------%
421: \paradot{Notes}
422: %-------------------------------%
423: Observe that we have stated the FPL algorithm regardless of the
424: actual \emph{predictions} of the experts and possible
425: \emph{observations}, only the \emph{losses} are relevant.
426: %
427: Note also that an expert can implement a highly complicated strategy
428: depending on past outcomes, despite its trivializing
429: identification with a constant unit vector. The complex expert's
430: (and environment's) behavior is summarized and hidden in the state
431: vector $s_t=$Loss$(x_t,y_t^i)_{1\leq i\leq n}$.
432: %
433: Our results therefore apply to \emph{arbitrary prediction and
434: observation spaces $\Y$ and $\X$ and arbitrary bounded loss
435: functions}.
436: This is in contrast to the major part of PEA work
437: developed for binary alphabet and 0/1 or absolute loss only.
438: %
439: Finally note that the setup allows for losses generated by an
440: adversary who tries to maximize the regret of FPL and knows the
441: FPL algorithm and all experts' past predictions/losses. If the
442: adversary also has access to FPL's past decisions, then FPL must
443: use independent randomization at each time step in order to
444: achieve good regret bounds.
445: 
446: %-------------------------------%
447: \paradot{Motivation of FPL}
448: %-------------------------------%
449: Let $d(s_{<t})$ be any predictor with decision based on $s_{<t}$.
450: The following identity is easy to show:
451: \beq\label{eqFId}
452:   \underbrace{\sum_{t=1}^T d(s_{<t})\scp s_t}_{\text{``FPL''}}
453:   \;\equiv\;
454:   \underbrace{_{\rule{0ex}{3.8ex}}d(s_{1:T})\scp s_{1:T}}_{\text{``BEH''}}
455:   + \overbrace{\underbrace{\sum_{t=1}^T [d(s_{<t})\!-\!d(s_{1:t})]\scp s_{<t}}_{\text{``IFPL}-\text{BEH''}}}^{\text{$\leq 0$ if $d\approx M$}}
456:   + \overbrace{\underbrace{\sum_{t=1}^T [d(s_{<t})\!-\!d(s_{1:t})]\scp s_t}_{\text{``FPL}-\text{IFPL''}}}^{\text{small if $d(\cdot)$ is continuous}}
457: \eeq
458: For a good bound of FPL in terms of BEH we need the first term on
459: the r.h.s.\ to be close to BEH and the last two terms to be small.
460: The first term is close to BEH if $d\approx M$. The second to last
461: term is even negative if $d=M$, hence small if $d\approx M$. The
462: last term is small if $d(s_{<t})\approx d(s_{1:t})$, which is the
463: case if $d(\cdot)$ is a sufficiently smooth function.
464: Randomization smoothes the discontinuous function $M$: The
465: function $d(s):=E[M(s-q)]$, where $q\in\SetR^n$ is some random
466: perturbation, is a continuous function in $s$. If the mean and
467: variance of $q$ are small, then $d\approx M$, if the variance of
468: $q$ is large, then $d(s_{<t})\approx d(s_{1:t})$. An intermediate
469: variance makes the last two terms of (\ref{eqFId}) simultaneously
470: small enough, leading to excellent bounds for FPL.
471: 
472: %-------------------------------%
473: \paradot{List of notation}\hfill\\
474: %-------------------------------%
475: $n\in\SetN\cup\{\infty\}$ ($n=\infty$ means countably infinite $\E$).\\
476: $x^i$ is $i$th component of vector $x\in\SetR^n$.\\
477: $\E:=\{e_i:1\leq i\leq n\}=$ set of unit vectors ($e_i^j=\delta_{ij}$).\\
478: $\Delta:=\{v\in[0,1]^n:\sum_i v^i=1\}$= simplex.\\
479: $s_t\in[0,1]^n$= environmental state/loss vector at time $t$.\\
480: $s_{1:t}:=s_1+...+s_t$= state/loss (similar for $\ell_t$ and $r_t$).\\
481: $s_{1:T}^{min}=\min_i\{s_{1:T}^i\}$= loss of Best Expert in Hindsight (BEH).\\
482: $s_{<t}:=s_1+...+s_{t-1}$= state/loss summary ($s_{<0}=0$).\\
483: $M(s):=\arg\min_{d\in\D}\{d\scp s\}$= best decision on $s$.\\
484: $T\in\SetN_0$= total time=step, $t\in\SetN$= current time=step.\\
485: $k^i\geq 0$= penalization = complexity of expert $i$.\\
486: $q\in\SetR^n$= random vector with independent exponentially distributed components.\\
487: $I_t:=\arg\min_{i\in\E}\{s_{<t}^i+{k^i-q^i\over\eta_t}\}$= randomized prediction of FPL.\\
488: $\ell_t:=E[M(s_{<t}+{k-q\over\eta_t})\scp s_t]$= expected loss at time $t$ of FPL (=$E[s_t^{I_t}]$ for $\D=\E$).\\
489: $r_t:=E[M(s_{1:t}+{k-q\over\eta_t})\scp s_t]$= expected loss at time $t$ of IFPL. \\
490: $u_t:=M(s_{<t}+{k-q\over\eta_t})\scp s_t$= actual loss at time $t$ of FPL (=$s_t^{I_t}$ for $\D=\E$).\\
491: 
492: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
493: \section{IFPL bounded by Best Expert in Hindsight}\label{secExpMax}\label{secIFPL}
494: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
495: 
496: In this section we provide tools for comparing the loss of IFPL
497: to the loss of the best expert in hindsight. The first result
498: bounds the expected error induced by the exponentially distributed
499: perturbation.
500: 
501: \begin{lemma}[Maximum of Shifted Exponential Distributions]\label{lemExpMax}
502: Let $q^1,...,q^n$ be (not necessarily independent) exponentially
503: distributed random variables, i.e.\ $P[q^i]=\e^{-q^i}$ for
504: $q^i\geq 0$ and $1\leq i\leq n\leq\infty$, and $k^i\in\SetR$ be
505: real numbers with $u:=\sum_{i=1}^n\e^{-k^i}$. Then
506: \bqan
507:   P[\max_i\{q^i-k^i\}\geq a]
508:   &=& 1-\prod_{i=1}^n \max\{0,1\!-\!\e^{-a-k^i}\}
509:   \qmbox{if} q^1,...,q^n \;\mbox{are independent,}
510: \\
511:   P[\max_i\{q^i-k^i\}\geq a]
512:   &\leq& \min\{1,u\,\e^{-a}\},
513: \\
514:   E[\max_i\{q^i-k^i\}] &\leq& 1+\ln u.
515: \eqan
516: \end{lemma}
517: 
518: \paradot{Proof} Using
519: \beqn
520:   P[q^i<a] = \max\{0,1\!-\!\e^{-a}\}\geq 1-\e^{-a}
521:   \qmbox{and}
522:   P[q^i\geq a] = \min\{1,\e^{-a}\}\leq \e^{-a},
523: \eeqn
524: valid for any $a\in\SetR$, the exact expression for $P[\max]$ in
525: Lemma~\ref{lemExpMax} follows from
526: \beqn
527:   P[\max_i\{q^i-k^i\}<a]
528:   = P[q^i-k^i<a\;\forall i]
529:   = \prod_{i=1}^n P[q^i<a+k^i]
530:   = \prod_{i=1}^n \max\{0,\e^{-a-k^i}\}
531: \eeqn
532: where the second equality follows from the independence of the
533: $q^i$. The bound on $P[\max]$ for any $a\in\SetR$ (including negative $a$) follows
534: from
535: \beqn
536:   P[\max_i\{q^i-k^i\}\geq a]
537:   = P[\exists i:q^i-k^i\geq a]
538:   \leq \sum_{i=1}^n P[q^i-k^i\geq a]
539:   \leq \sum_{i=1}^n \e^{-a-k^i} = u\!\cdot\!\e^{-a}
540: \eeqn
541: where the first inequality is the union bound.
542: Using $E[z]\leq E[\max\{0,z\}]=\int_0^\infty P[\max\{0,z\}\geq
543: y]dy = \int_0^\infty P[z\geq y]dy$ (valid for any real-valued
544: random variable $z$) for $z=\max_i\{q^i-k^i\}-\ln u$, this implies
545: \beqn
546:   E[\max_i\{q^i-k^i\}-\ln u]
547:   \leq \int_0^\infty P\big[\max_i\{q^i-k^i\}\geq y+\ln u \big]dy
548:   \leq \int_0^\infty \e^{-y} dy\ = \ 1,
549: \eeqn
550: which proves the bound on $E[\max]$.
551: \qed
552: 
553: If $n$ is finite, a lower bound $E[\max_i q^i]\geq 0.57721+\ln n$
554: can be derived, showing that the upper bound on $E[\max]$ is quite
555: tight (at least) for $k^i=0$ $\forall i$.
556: %
557: The following bound generalizes \citep[Lem.3]{Kalai:03} to
558: arbitrary weights, establishing a relation between IFPL and
559: the best expert in hindsight.
560: 
561: \begin{theorem}[IFPL bounded by BEH]\label{thIFPL}
562: Let $\D\subseteq\SetR^n$, $s_t\in\SetR^n$ for $1\leq t\leq T$
563: (both $\D$ and $s$ may even have negative components, but we assume that all
564: required extrema are attained), and $q,k\in\SetR^n$. If
565: $\eta_t>0$ is decreasing in $t$, then the loss of the infeasible FPL
566: knowing $s_t$ at time $t$ in advance (l.h.s.) can be bounded in
567: terms of the best predictor in hindsight (first term on r.h.s.) plus
568: additive corrections:
569: \beqn
570:   \sum_{t=1}^T M(s_{1:t}+{k\!-\!q\over\eta_t})\scp s_t
571:   \leq \min_{d\in\D}\{d\scp(s_{1:T}+{k\over\eta_T})\}
572:      + {1\over\eta_T}\max_{d\in\D}\{d\scp(q-k)\}
573:      - {1\over\eta_T} M(s_{1:T}+{k\over\eta_T})\scp q.
574: \eeqn
575: \end{theorem}
576: 
577: Note that if $\D=\E$ (or $\D=\Delta$) and $s_t\geq 0$, then
578: all extrema in the theorem are attained almost surely. The
579: same holds for all subsequent extrema in the proof and
580: throughout the paper.
581: 
582: \paradot{Proof} For notational convenience, let $\eta_0=\infty$ and
583: $\tilde s\leqt=s\leqt+\frac{k-q}{\eta_t}$. Consider the losses
584: $\tilde s_t=s_t+(k-q)\big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\big)$
585: for the moment. We first show by induction on $T$ that the infeasible
586: predictor $M(\tilde s_{1:t})$ has zero regret for any loss $\tilde
587: s$, i.e.\
588: \beq\label{eqnoregret}
589:   \sum_{t=1}^T M(\tilde s_{1:t})\scp \tilde s_t \leq M(\tilde s_{1:T})\scp \tilde s_{1:T}.
590: \eeq
591: For $T=1$ this is obvious. For the induction step from $T-1$ to $T$
592: we need to show
593: \beq\label{eq:noregret1}
594:   M(\tilde s_{1:T})\scp \tilde s_T \leq M(\tilde s_{1:T})\scp
595:   \tilde s_{1:T} - M(\tilde s_{<T})\scp \tilde s_{<T}.
596: \eeq
597: This follows from $\tilde s_{1:T}=\tilde s_{<T}+\tilde s_T$ and
598: $M(\tilde s_{1:T})\scp \tilde s_{<T} \geq M(\tilde s_{<T})\scp
599: \tilde s_{<T}$ by minimality of $M$.
600: Rearranging terms in (\ref{eqnoregret}), we obtain
601: \beq\label{eqifpl2}
602:   \sum_{t=1}^T M(\tilde s_{1:t})\scp s_t
603:   \ \leq\
604:   M(\tilde s_{1:T})\scp \tilde s_{1:T}- \sum_{t=1}^T M(\tilde s_{1:t})\scp
605:   (k-q)\Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big)
606: \eeq
607: Moreover, by minimality of $M$,
608: \bqa
609: \label{eqifpl4}
610: M(\tilde s_{1:T})\scp \tilde s_{1:T} & \leq &
611: M\Big(s_{1:T}+\frac{k}{\eta_T}\Big)\scp
612: \Big(s_{1:T}+\frac{k-q}{\eta_T}\Big)\\
613: \nonumber
614: & = & \min_{d\in\D}\left\{d\scp(s_{1:T}+{k\over\eta_T})\right\}-
615: M\Big(s_{1:T}+\frac{k}{\eta_T}\Big)\scp
616: \frac{q}{\eta_T}
617: \eqa
618: holds. Using ${1\over\eta_t}-{1\over\eta_{t-1}}\geq 0$ and again
619: minimality of $M$, we have
620: \bqa\label{eqifpl3}
621: \sum_{t=1}^T
622: ({1\over\eta_t}-{1\over\eta_{t-1}})M(\tilde s_{1:t})\scp(q-k) &
623: \leq & \sum_{t=1}^T
624: ({1\over\eta_t}-{1\over\eta_{t-1}})M(k-q)\scp(q-k)\\
625: \nonumber
626: &  = & {1\over\eta_T}M(k-q)\scp(q-k)
627: = {1\over\eta_T}\max_{d\in\D}\{d\scp(q-k)\}
628: \eqa
629: Inserting (\ref{eqifpl4}) and (\ref{eqifpl3}) back into (\ref{eqifpl2})
630: we obtain the assertion.
631: \qed
632: 
633: Assuming $q$ random with $E[q^i]=1$ and taking the expectation in
634: Theorem~\ref{thIFPL}, the last term reduces to
635: $-{1\over\eta_T}\sum_{i=1}^n M(s_{1:T}+{k\over\eta_T})^i$.
636: If $\D\geq 0$, the term is negative and may be dropped. In case of
637: $\D=\E$ or $\Delta$, the last term is identical to
638: $-{1\over\eta_T}$ (since $\sum_i d^i=1$) and keeping it improves
639: the bound.
640: %
641: Furthermore, we need to evaluate the expectation of the second to
642: last term in Theorem~\ref{thIFPL}, namely
643: $E[\max_{d\in\D}\{d\scp(q-k)\}]$. For $\D=\E$ and $q$ being
644: exponentially distributed, using Lemma~\ref{lemExpMax}, the
645: expectation is bounded by $1+\ln u$. We hence get the following
646: bound:
647: 
648: \begin{corollary}[IFPL bounded by BEH]\label{corIFPL}
649: For $\D=\E$ and $\sum_i \e^{-k^i}\leq 1$ and
650: $P[q^i]=\e^{-q^i}$ for $q\geq 0$ and decreasing $\eta_t>0$, the
651: expected loss of the infeasible FPL exceeds the loss of expert $i$
652: by at most $k^i/\eta_T$:
653: \beqn
654:   r_{1:T} \;\leq\; s_{1:T}^i + {1\over\eta_T}k^i  \quad\forall i.
655: \eeqn
656: \end{corollary}
657: 
658: Theorem~\ref{thIFPL} can be generalized to expert
659: dependent factorizable $\eta_t\leadsto \eta_t^i=\eta_t\cdot\eta^i$
660: by scaling $k^i\leadsto k^i/\eta^i$ and $q^i\leadsto q^i/\eta^i$.
661: Using $E[\max_i\{{q^i-k^i\over\eta^i}\}]\leq
662: E[\max_i\{q^i-k^i\}]/\min_i\{\eta^i\}$, Corollary~\ref{corIFPL},
663: generalizes to
664: \beqn
665:     E[\sum_{t=1}^T M(s_{1:t}+{k-q\over\eta_t^i})\scp s_t]
666:     \;\leq\; s_{1:T}^i + {1\over\eta_T^i}k^i + {1\over\eta_T^{min}}
667:     \quad\forall i,
668: \eeqn
669: where $\eta_T^{min}:=\min_i\{\eta_T^i\}$.
670: For example, for $\eta_t^i=\sqrt{k^i/t}$
671: we get the desired bound $s_{1:T}^i+\sqrt{T\cdot(k^i+4)}$.
672: Unfortunately we were not able to generalize Theorem~\ref{thFIFPL}
673: to expert-dependent $\eta$, necessary for the final bound on FPL.
674: In Section~\ref{secHierarchy} we solve this problem by a hierarchy
675: of experts.
676: 
677: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
678: \section{Feasible FPL bounded by Infeasible FPL}\label{secFFPL}
679: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
680: 
681: This section establishes the relation between the FPL and IFPL
682: losses. Recall that $\ell_t=E\big[M(s_{<t}+{k-q\over\eta_t})\scp
683: s_t\big]$ is the expected loss of FPL at time $t$ and
684: $r_t=E\big[M(s_{1:t}+{k-q\over\eta_t})\scp s_t\big]$ is the
685: expected loss of IFPL at time $t$.
686: 
687: \begin{theorem}[FPL bounded by IFPL]\label{thFIFPL}
688: For $\D=\E$ and $0\leq s_t^i\leq 1$ $\forall i$ and arbitrary
689: $s_{<t}$ and $P[q]=\e^{-\sum_i q^i}$ for $q\geq 0$, the expected
690: loss of the feasible FPL is at most a factor $\e^{\eta_t}>1$
691: larger than for the infeasible FPL:
692: \beqn
693:   \ell_t\leq \e^{\eta_t}r_t, \qmbox{which implies}
694:   \ell_{1:T}-r_{1:T}\leq \sum_{t=1}^T\eta_t \ell_t.
695: \eeqn
696: Furthermore, if $\eta_t\leq 1$, then also $\ell_t\leq
697: (1+\eta_t+\eta_t^2)r_t\leq (1+2\eta_t)r_t$.
698: \end{theorem}
699: 
700: \paradot{Proof}
701: Let $s=s_{<t}+\sooe k$ be the past cumulative penalized state
702: vector, $q$ be a vector of independent exponential distributions,
703: i.e.\ $P[q^i]=\e^{-q^i}$, and $\eta=\eta_t$.
704: Then
705: \beqn
706:   {P[q^j\geq \eta(s^j-m+1)]\over P[q^j\geq\eta(s^j-m)]}
707:   = \left\{%
708: \begin{array}{ccc}
709:   \e^{-\eta}        & \mbox{if} & s^j\geq m \\
710:   \e^{-\eta(s^j-m+1)} & \mbox{if} & m-1\leq s^j\leq m \\
711:   1                  & \mbox{if} & s^j\leq m-1 \\
712: \end{array}%
713: \right\} \geq \e^{-\eta}
714: \eeqn
715: We now define the random variables $I:=\arg\min_i\{s^i-\sooe q^i\}$ and
716: $J:=\arg\min_i\{s^i+s_t^i-\sooe q^i\}$, where $0\leq s_t^i\leq 1$
717: $\forall i$. Furthermore, for fixed vector $x\in\SetR^n$ and fixed
718: $j$ we define $m:=\min_{i\neq j}\{s^i-\sooe x^i\}\leq \min_{i\neq
719: j}\{s^i+s_t^i-\sooe x^i\}=:m'$.
720: With this notation and using the independence of $q^j$ from $q^i$
721: for all $i\neq j$, we get
722: \beqn
723:   P[I=j|q^i=x^i\,\forall i\neq j]
724:   \;=\; P[s^j-\sooe q^j\leq m|q^i=x^i\,\forall i\neq j]
725:   \;=\; P[q^j\geq\eta(s^j-m)]
726: \eeqn
727: \beqn
728:   \;\leq\; \e^\eta P[q^j\geq\eta(s^j-m+1)]
729:   \;\leq\; \e^\eta P[q^j\geq\eta(s^j+s_t^j-m')]
730: \eeqn
731: \beqn
732:   \;=\; \e^\eta P[s^j+s_t^j-\sooe q^j\leq m'|q^i=x^i\,\forall i\neq j]
733:   \;=\; \e^\eta P[J=j|q^i=x^i\,\forall i\neq j]
734: \eeqn
735: Since this bound holds under any condition $x$, it also holds
736: unconditionally, i.e.\ $P[I=j]\leq \e^\eta P[J=j]$. For
737: $\D=\E$ we have $s_t^I=M(s_{<t}+{k-q\over\eta})\scp s_t$ and
738: $s_t^J=M(s_{1:t}+{k-q\over\eta})\scp s_t$, which implies
739: \beqn
740:   \ell_t
741:   \;=\;E[s_t^I]
742:   \;=\; \sum_{j=1}^n s_t^j\!\cdot\!P[I=j]
743:   \;\leq\; \e^\eta \sum_{j=1}^n s_t^j\!\cdot\!P[J=j]
744:   \;=\; \e^\eta E[s_t^J]
745:   \;=\; \e^\eta r_t.
746: \eeqn
747: Finally, $\ell_t-r_t\leq\eta_t\ell_t$ follows from $r_t\geq
748: \e^{-\eta_t}\ell_t\geq (1-\eta_t)\ell_t$, and $\ell_t\leq
749: \e^{\eta_t}r_t\leq (1+\eta_t+\eta_t^2)r_t\leq (1+2\eta_t)r_t$ for
750: $\eta_t\leq 1$ is elementary.
751: \qed
752: 
753: \paradot{Remark}
754: As done by \citet{Kalai:03}, one can prove a similar statement
755: for general decision space $\D$ as long as
756: $\sum_i|s_t^i|\leq A$ is guaranteed for some $A>0$: In this
757: case, we have $\ell_t\leq \e^{\eta_t A}r_t$. If $n$ is finite,
758: then the bound holds for $A=n$. For $n=\infty$, the assertion
759: holds under the somewhat unnatural assumption that $\S$ is
760: $l^1$-bounded.
761: 
762: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
763: \section{\boldmath Combination of Bounds and Choices for $\eta_t$}\label{secBounds}
764: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
765: 
766: Throughout this section, we assume
767: \beq
768: \label{eq:Assumptions}
769:   \D=\E,\quad s_t\in[0,1]^n\ \forall t,\quad
770:   P[q]=\e^{-\sum_i q^i} \;\mbox{for}\; q\geq 0,\ \qmbox{and}
771:   \sum_i \e^{-k^i}\leq 1.
772: \eeq
773: We distinguish \emph{static} and \emph{dynamic} bounds. Static
774: bounds refer to a constant $\eta_t\equiv\eta$. Since this value
775: has to be chosen in advance, a static choice of $\eta_t$ requires
776: certain prior information and therefore is not practical in many
777: cases. However, the static bounds are very easy to derive, and
778: they provide a good means to compare different PEA algorithms. If
779: on the other hand the algorithm shall be applied without
780: appropriate prior knowledge, a dynamic choice of $\eta_t$ depending
781: only on $t$ and/or past observations, is necessary.
782: 
783: \begin{theorem}[FPL bound for static $\eta_t=\eta\propto 1/\sqrt{L}$]\label{thFPLStatic}
784: Assume (\ref{eq:Assumptions}) holds, then the expected loss
785: $\ell_t$ of feasible FPL, which employs the prediction of the
786: expert $i$ minimizing $s_{<t}^i+{k^i-q^i\over\eta_t}$, is bounded
787: by the loss of the best expert in hindsight in the following way:
788: \bqan
789:   i) & & \nq
790:   \mbox{For}\quad \eta_t=\eta=1/\sqrt{L}
791:   \qmbox{with} L\geq\ell_{1:T}
792:   \qmbox{we have}
793: \\
794:      & & \nq
795:   \ell_{1:T}
796:   \;\leq\; s_{1:T}^i + \sqrt{L}(k^i+1) \quad\forall i
797: \\
798:   ii) & & \nq
799:   \mbox{For}\quad \eta_t=\sqrt{K/L}
800:   \qmbox{with} L\geq\ell_{1:T}
801:   \qmbox{and} k^i\leq K \;\forall i
802:   \qmbox{we have}
803: \\
804:     & & \nq
805:   \ell_{1:T}
806:   \;\leq\; s_{1:T}^i + 2\sqrt{LK} \quad\forall i
807: \\
808:   iii) & & \nq
809:   \mbox{For}\quad \eta_t=\sqrt{k^i/L}
810:   \qmbox{with} L\geq \max\{s_{1:T}^i,k^i\}
811:   \qmbox{we have}
812: \\
813:     & & \nq
814:   \ell_{1:T}
815:   \;\leq\; s_{1:T}^i + 2\sqrt{Lk^i}+3k^i
816: \eqan
817: \end{theorem}
818: 
819: Note that according to assertion $(iii)$, knowledge of only the
820: \emph{ratio} of the complexity and the loss of the best
821: expert is sufficient in order to obtain good static bounds, even
822: for non-uniform complexities.
823: 
824: \paradot{Proof} $(i,ii)$ For $\eta_t=\sqrt{K/L}$ and $L\geq\ell_{1:T}$,
825: from Theorem~\ref{thFIFPL} and Corollary
826: \ref{corIFPL}, we get
827: \beqn
828:   \ell_{1:T}-r_{1:T}
829:   \leq \sum_{t=1}^T\eta_t\ell_t
830:   = \ell_{1:T}\sqrt{K/L}\leq\sqrt{LK}
831:   \qmbox{and}
832:   r_{1:T}-s_{1:T}^i
833:   \leq k^i/\eta_T=k^i\sqrt{L/K}
834: \eeqn
835: Combining both, we get
836: $\ell_{1:T}-s_{1:T}^i\leq\sqrt{L}(\sqrt{K}+k^i/\sqrt{K})$.
837: $(i)$ follows from $K=1$ and $(ii)$ from $k^i\leq K$.
838: 
839: \noindent
840: $(iii)$ For $\eta=\sqrt{k^i/L}\leq 1$ we get
841: \bqan
842:   \ell_{1:T}
843:   & \leq & \e^\eta r_{1:T}
844:   \leq (1+\eta+\eta^2)r_{1:T}
845:   \leq (1+\sqrt{k^i\over L}+{k^i\over L})(s_{1:T}^i+\sqrt{L\over
846:   k^i}k^i)\\
847:   & \leq & s_{1:T}^i+\sqrt{Lk^i} +(\sqrt{k^i\over L}+{k^i\over L})(L+\sqrt{Lk^i})
848:   = s_{1:T}^i + 2\sqrt{Lk^i} +(2+\sqrt{k^i\over L})k^i
849: \eqan
850: \qed
851: 
852: The static bounds require knowledge of an upper bound $L$ on the
853: loss (or the ratio of the complexity of the best expert and its
854: loss). Since the instantaneous loss is bounded by $1$, one may set
855: $L=T$ if $T$ is known in advance. For finite $n$ and $k^i=K=\ln
856: n$, bound $(ii)$ gives the classic regret $\propto\sqrt{T\ln
857: n}$. If neither $T$ nor $L$ is known, a dynamic choice of $\eta_t$
858: is necessary. We first present bounds with regret $\propto\sqrt{T}$,
859: thereafter with regret $\propto\sqrt{s_{1:T}^i}$.
860: 
861: \begin{theorem}[FPL bound for dynamic $\eta_t\propto 1/\sqrt{t}$]\label{thFPLTDynamic}
862: Assume (\ref{eq:Assumptions}) holds.
863: \bqan
864:   i) & & \nq
865:   \mbox{For}\quad \eta_t=1/\sqrt{t}
866:   \qmbox{we have}
867:   \ell_{1:T} \;\leq\; s_{1:T}^i + \sqrt{T}(k^i+2) \quad\forall i
868: \\
869:   ii) & & \nq
870:   \mbox{For}\quad \eta_t=\sqrt{K/2t}
871:   \;\;\mbox{and}\;\; k^i\leq K \;\forall i
872:   \;\;\mbox{we have}\;\;
873:   \ell_{1:T} \;\leq\; s_{1:T}^i + 2\sqrt{2TK}
874:   \quad\forall i
875: \eqan
876: \end{theorem}
877: 
878: \paradot{Proof} For $\eta_t=\sqrt{K/2t}$, using
879: $\sum_{t=1}^T{1\over\sqrt{t}}\leq\int_0^T{dt\over\sqrt{t}}=
880: 2\sqrt{T}$ and $\ell_t\leq 1$ we get
881: \beqn
882:   \ell_{1:T}-r_{1:T}
883:   \leq \sum_{t=1}^T \eta_t
884:   \leq \sqrt{2TK}
885:   \qmbox{and}
886:   r_{1:T}-s_{1:T}^i
887:   \leq {k^i/\eta_T}=k^i\sqrt{2T\over K}
888: \eeqn
889: Combining both, we get
890: $\ell_{1:T}-s_{1:T}^i \leq \sqrt{2T}(\sqrt{K}+k^i/\sqrt{K})$.
891: $(i)$ follows from $K=2$ and $(ii)$ from $k^i\leq K$.
892: \qed
893: 
894: In Theorem~\ref{thFPLStatic} we assumed knowledge of an
895: upper bound $L$ on $\ell_{1:T}$. In an adaptive form,
896: $L_t:=\ell_{<t}+1$, known at the beginning of time $t$, could be used
897: as an upper bound on $\ell_{1:t}$ with corresponding adaptive
898: $\eta_t\propto 1/\sqrt{L_t}$. Such choice of $\eta_t$ is also
899: called \emph{self-confident} \citep{Auer:02pea}.
900: 
901: \begin{theorem}[FPL bound for self-confident $\eta_t\propto 1/\sqrt{\ell_{<t}}$]\label{thFPLLDynamic}
902: Assume (\ref{eq:Assumptions}) holds.
903: \bqan
904:   i) & & \nq
905:   \mbox{For}\quad \eta_t=1/\sqrt{2(\ell_{<t}+1)}
906:   \qmbox{we have}
907: \\
908:    & & \nq
909:   \ell_{1:T}
910:   \;\leq\; s_{1:T}^i + (k^i\!+\!1)\sqrt{2(s_{1:T}^i\!+\!1)} + 2(k^i\!+\!1)^2
911:   \quad\forall i
912: \\
913:   ii) & & \nq
914:   \mbox{For}\quad \eta_t=\sqrt{K/2(\ell_{<t}+1)}
915:   \qmbox{and} k^i\leq K \;\forall i
916:   \qmbox{we have}
917: \\
918:     & & \nq
919:   \ell_{1:T}
920:   \;\leq\; s_{1:T}^i + 2\sqrt{2(s_{1:T}^i\!+\!1)K} + 8K
921:   \quad\forall i
922: \eqan
923: \end{theorem}
924: 
925: \paradot{Proof} Using
926: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}\leq\sqrt{K/2\ell_{1:t}}$ and
927: ${b-a\over\sqrt
928: b}=(\sqrt{b}-\sqrt{a})(\sqrt{b}+\sqrt{a}){1\over\sqrt{b}}\leq
929: 2(\sqrt{b}-\sqrt{a})$ for $a\leq b$ and $t_0:=\min\{t:\ell_{1:t}>0\}$ we get
930: \beqn\label{eqLD}
931:   \ell_{1:T}\!-\!r_{1:T}
932:   \leq \sum_{t=t_0}^T \eta_t\ell_t
933:   \leq \sqrt{K\over 2}\sum_{t=t_0}^T {\ell_{1:t}\!-\!\ell_{<t}\over\sqrt{\ell_{1:t}}}
934:   \leq \sqrt{2K}\sum_{t=t_0}^T [\sqrt{\ell_{1:t\!\!}}\,-\!\sqrt{\ell_{<t\!\!}}\;]
935:   = \sqrt{2K}\sqrt{\ell_{1:T}}
936: \eeqn
937: Adding
938: $r_{1:T}-s_{1:T}^i \leq {k^i\over\eta_T} \leq
939: k^i\sqrt{2(\ell_{1:T}+1)/K}$ we get
940: \beqn
941:   \ell_{1:T}-s_{1:T}^i
942:   \leq \sqrt{2\bar\kappa^i(\ell_{1:T}\!+\!1)},
943:   \qmbox{where}
944:   \sqrt{\bar\kappa^i}:=\sqrt{K}+k^i/\sqrt{K}.
945: \eeqn
946: Taking the square and solving the resulting quadratic inequality
947: w.r.t.\ $\ell_{1:T}$ we get
948: \beqn
949:   \ell_{1:T}
950:   \leq s_{1:T}^i + \bar\kappa^i + \sqrt{2(s_{1:T}^i\!+\!1)\bar\kappa^i+(\bar\kappa^i)^2}
951:   \leq s_{1:T}^i + \sqrt{2(s_{1:T}^i\!+\!1)\bar\kappa^i} + 2\bar\kappa^i
952: \eeqn
953: For $K=1$ we get $\sqrt{\bar\kappa^i}=k^i+1$ which yields $(i)$.
954: For $k^i\leq K$ we get $\bar\kappa^i\leq 4K$ which yields $(ii)$.
955: \qed
956: 
957: The proofs of results similar to $(ii)$ for WM for 0/1 loss
958: all fill several pages \citep{Auer:02pea,Yaroshinsky:04}. The
959: next result establishes a similar bound, but instead of using
960: the \emph{expected} value $\ell\ltt$, the \emph{best loss so
961: far}
962: $s\ltt\smin$ is used. This may have computational advantages,
963: since $s\ltt\smin$ is immediately available, while $\ell\ltt$
964: needs to be evaluated (see discussion in Section~\ref{secMisc}).
965: 
966: \begin{theorem}[FPL bound for adaptive $\eta_t\propto 1/\sqrt{s\ltt\smin}$]\label{thFPL2}
967: Assume (\ref{eq:Assumptions}) holds.
968: \bqan
969:   i) & & \nq
970:   \mbox{For}\quad \eta_t = 1/\min_i\{k^i+\sqrt{(k^i)^2+2s^i\ltt+2}\}
971:   \qmbox{we have}
972: \\
973:    & & \nq
974:   \ell\leqT \;\leq\; s\leqT^i+(k^i\!+2)\sqrt{2s\leqT^i}+2(k^i\!+2)^2
975:   \quad \forall i
976: \\
977:   ii) & & \nq
978:   \mbox{For}\quad \eta_t =
979:   \sqrt{\odt}\!\cdot\!\min\{1,\sqrt{K/s\ltt\smin}\}
980:   \qmbox{and} k^i\leq K \;\forall i
981:   \qmbox{we have}
982: \\
983:     & & \nq
984:   \ell\leqT \;\leq\;
985:   s\leqT^i+2\sqrt{2K s\leqT^i}+5K\ln(s\leqT^i)+3K+6
986:   \quad \forall i
987: \eqan
988: \end{theorem}
989: %
990: We briefly motivate the strange looking choice for $\eta_t$ in
991: $(i)$. The first naive candidate, $\eta_t\propto 1/\sqrt{s\ltt^{min}}$,
992: turns out too large. The next natural trial is requesting
993: $\eta_t=1/\sqrt{2\min\{s\ltt^i+\frac{k^i}{\eta_t}\}}$. Solving
994: this equation results in $\eta_t=1/(k^i+\sqrt{(k^i)^2+2s\ltt^i})$,
995: where $i$ be the index for which $s\ltt^i+\frac{k^i}{\eta_t}$ is
996: minimal.
997: 
998: \paradot{Proof}
999: Define the minimum of a vector as its minimum component, e.g.\
1000: $\min(k)=k\smin$.
1001: For notational convenience, let
1002: $\eta_0=\infty$ and $\tilde s\leqt=s\leqt+\frac{k-q}{\eta_t}$.
1003: Like in the proof of Theorem~\ref{thIFPL}, we consider one
1004: exponentially distributed perturbation $q$. Since $M(\tilde
1005: s\leqt)\scp \tilde s_t \leq M(\tilde s\leqt)\scp \tilde s\leqt-
1006: M(\tilde s\ltt)\scp \tilde s\ltt$ by (\ref{eq:noregret1}), we have
1007: \beqn
1008: M(\tilde s\leqt)\scp s_t\leq M(\tilde s\leqt)\scp
1009: \tilde s\leqt- M(\tilde s\ltt)\scp \tilde s\ltt -
1010: M(\tilde s\leqt)\scp
1011: \left(\frac{k-q}{\eta_{t}}-\frac{k-q}{\eta_{t-1}}\right)
1012: \eeqn
1013: Since $\eta_t\leq\sqrt{^1\!/_2}$, Theorem~\ref{thFIFPL} asserts
1014: $\ell_t\leq E[(1+\eta_t+\eta_t^2)M(\tilde s\leqt)\scp s_t]$, thus
1015: $\ell\leqT\leq A+B$, where
1016: \bqan
1017: A & = & \sum_{t=1}^T E\left[(1+\eta_t+\eta_t^2)(M(\tilde
1018: s\leqt)\scp
1019: \tilde s\leqt- M(\tilde s\ltt)\scp \tilde s\ltt)\right]\\
1020: & = &
1021: E[(1+\eta_T+\eta_T^2)M(\tilde s\leqT) \scp\tilde s\leqT]
1022: - E[(1+\eta_1+\eta_1^2)\min(\frac{k-q}{\eta_1})]\\
1023: && + \sum_{t=1}^{T-1}E\left[
1024: (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)M(\tilde
1025: s\leqt)\scp\tilde s\leqt\right]
1026: \mbox{\quad and}\\
1027: B & = & \sum_{t=1}^T E\left[(1+\eta_t+\eta_t^2) M(\tilde
1028: s\leqt)\scp
1029: \left(\frac{q-k}{\eta_{t}}-\frac{q-k}{\eta_{t-1}}\right)\right]\\
1030: & \leq & \sum_{t=1}^T (1+\eta_t+\eta_t^2)
1031: \left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)
1032: =\frac{1+\eta_T+\eta_T^2}{\eta_T}+
1033: \sum_{t=1}^{T-1}\frac{\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2}{\eta_t}
1034: \eqan
1035: Here, the estimate for $B$ follows from
1036: $\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\geq 0$ and
1037: $E [M(\eta_t s\leqt+k-q)\scp(q-k)]\leq E[\max_i\{q^i-k^i\}]\leq 1$, which
1038: in turn holds by minimality of $M$, $\sum_i \e^{-k^i}\leq 1$ and
1039: Lemma~\ref{lemExpMax}. In order to estimate $A$, we set
1040: $\bar s\leqt=s\leqt+\frac{k}{\eta_t}$ and observe
1041: $M(\tilde s\leqt)\scp\tilde s\leqt\leq
1042: M(\bar s\leqt)\scp(\bar s\leqt-\frac{q}{\eta_t})$ by minimality
1043: of $M$. The expectations of $q$ can then be evaluated to
1044: $E[M(\bar s\leqt)\scp q]=1$, and as before we have $E[-\min(k-q)]\leq 1$.
1045: Hence
1046: \bqa
1047: \nonumber
1048: \ell\leqT & \leq & A+B\ \leq\
1049: (1+\eta_T+\eta_T^2)\left(M(\bar s\leqT)\scp\bar
1050: s\leqT-\frac{1}{\eta_T}\right)
1051: + \frac{1+\eta_1+\eta_1^2}{\eta_1}\\
1052: \label{eq:basicest}
1053: && + \sum_{t=1}^{T-1}
1054: (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)\left(M(\bar
1055: s\leqt)\scp\bar s\leqt-\frac{1}{\eta_t}\right)+B\\
1056: \nonumber
1057: & \leq &
1058: (1+\eta_T+\eta_T^2)\min(\bar s\leqT)+
1059: \sum_{t=1}^{T-1} (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)
1060: \min(\bar s\leqt)+\frac{1}{\eta_1}+2.
1061: \eqa
1062: We now proceed by considering the two parts of the theorem
1063: separately.
1064: 
1065: $(i)$
1066: Here,
1067: $\eta_t=1/\min(k+\sqrt{k^2+2s\ltt+2})$. Fix $t\leq T$ and
1068: choose $m$ such that
1069: $k^m+\sqrt{(k^m)^2+2s\ltt^m+2}$ is minimal. Then
1070: \beqn \min(s\leqt+\frac{k}{\eta_t})
1071: \leq s\ltt^m+1+\frac{k^m}{\eta_t}
1072: =
1073: \mbox{$\frac{1}{2}$}\big(k^m+\sqrt{(k^m)^2+2s\ltt^m+2}\big)^2=\frac{1}{2\eta_t^2}
1074: \leq\frac{1}{2\eta_t\eta_{t+1}}.
1075: \eeqn
1076: We may overestimate the quadratic terms $\eta_t^2$ in
1077: (\ref{eq:basicest}) by $\eta_t$ -- the easiest justification
1078: is that we could have started with the cruder estimate
1079: $\ell_t\leq(1+2\eta_t)r_t$ from Theorem~\ref{thFIFPL}. Then
1080: \bqan
1081: \ell\leqT & \leq &
1082: (1+2\eta_T)\min(s\leqT+\frac{k}{\eta_T})+ 2\sum_{t=1}^{T-1}
1083: (\eta_t-\eta_{t+1})\min(s\leqt+\frac{k}{\eta_t})+\frac{1}{\eta_1}+2\\
1084: & \leq &
1085: (1+2\eta_T)\frac{1}{2\eta_T^2}+ 2\sum_{t=1}^{T-1}
1086: (\eta_t-\eta_{t+1})\frac{1}{2\eta_t^2}+\frac{1}{\eta_1}+2\\
1087: & \leq & \frac{1}{2\eta_T^2}+\frac{1}{\eta_T}+
1088: \sum_{t=1}^{T-1}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_t}\right)+\frac{1}{\eta_1}+2\\
1089: & \leq &
1090: \mbox{$\frac{1}{2}$}\min(k+\sqrt{k^2+2s\ltT+2})^2+2\min(k+\sqrt{k^2+2s\ltT+2}) +2\\
1091: & \leq &
1092: s\leqT^i+(k^i+2)\sqrt{2s\leqT^i}+2(k^i)^2+6k^i+6
1093: \quad\mbox{for all}\ i.
1094: \eqan
1095: This proves the first part of the theorem.
1096: 
1097: $(ii)$ Here we have $K\geq k^i$ for all $i$. Abbreviate
1098: $a_t=\max\{K,s\leqt\smin\}$ for $1\leq t\leq T$, then
1099: $\eta_t=\sqrt{\frac{K}{2a_{t-1}}}$,
1100: $a_t\geq K$, and $a_t-a_{t-1}\leq 1$ for all $t$. Observe
1101: $M(\bar s\leqt)=M(s\leqt)$,
1102: $\eta_t-\eta_{t+1}=\frac{\sqrt K(a_t-a_{t-1})}
1103: {\sqrt 2\sqrt{a_t}\sqrt{a_{t-1}} (\sqrt{a_t}+\sqrt{a_{t-1}})}$,
1104: $\eta_t^2-\eta_{t+1}^2=\frac{K(a_t-a_{t-1})}{2a_ta_{t-1}}$, and
1105: $\frac{a_t-a_{t-1}} {2 a_{t-1}}\leq
1106: \ln(1+\frac{a_t-a_{t-1}}{a_{t-1}})=\ln (a_t)-\ln (a_{t-1})$ which is true for
1107: $\frac{a_t-a_{t-1}}{a_{t-1}}\leq\frac{1}{K}\leq\frac{1}{\ln 2}$. This
1108: implies
1109: \bqan
1110: \frac{(\eta_t-\eta_{t+1})K}{\eta_t} & \leq &
1111: \frac{K(a_t-a_{t-1})}
1112: {2 a_{t-1}}\leq K\ln\left(1+\frac{a_t-a_{t-1}}{a_{t-1}}\right)
1113: = K\big(\ln(a_t)-\ln(a_{t-1})\big), \\
1114: (\eta_t-\eta_{t+1})s\leqt\smin & \leq &
1115: \frac{\sqrt K(a_t-a_{t-1})(\sqrt{a_{t-1}}+
1116: \sqrt{a_t}-\sqrt{a_{t-1}})}
1117: {\sqrt 2\sqrt{a_{t-1}} (\sqrt{a_t}+\sqrt{a_{t-1}})}\\
1118: & = & \sqrt{\frac{K}{2}}(\sqrt{a_t}-\sqrt{a_{t-1}})
1119: +\frac{\sqrt K(a_t-a_{t-1})^2}{\sqrt{2a_{t-1}}(\sqrt{a_t}+\sqrt{a_{t-1}})^2}\\
1120: & \leqs
1121: {\hspace*{-5cm}\rlap{\fbox{$\stackrel{\mbox{\tiny use
1122: }\scriptstyle a_t-a_{t-1}\leq 1} {\mbox{\tiny and }
1123: \scriptstyle a_{t-1}\geq K}$}}}
1124: & \sqrt{\frac{K}{2}}(\sqrt{a_t}-\sqrt{a_{t-1}})+
1125: \frac{1}{2\sqrt 2}\big(\ln(a_t)-\ln(a_{t-1})\big),\\
1126: \frac{(\eta_t^2-\eta_{t+1}^2)K}{\eta_t} & = &
1127: \frac{K\sqrt{K}(a_t-a_{t-1})}{\sqrt{2}a_t\sqrt{a_{t-1}}}
1128: \leqs {\fbox{$\scriptstyle a_{t-1}\geq K$}}
1129: \sqrt{2}K\big(\ln(a_t)-\ln(a_{t-1})\big), \mbox{ and}\\
1130: (\eta_t^2-\eta_{t+1}^2)s\leqt\smin & \leq &
1131: \frac{K(a_t-a_{t-1})}{2a_{t-1}} \leq
1132: K\big(\ln(a_t)-\ln(a_{t-1})\big),
1133: \eqan
1134: The logarithmic estimate in the second and
1135: third bound is unnecessarily rough and for convenience only.
1136: Therefore, the coefficient of the log-term in the final bound of
1137: the theorem can be reduced to
1138: $2K$ without much effort. Plugging the above estimates back into
1139: (\ref{eq:basicest}) yields
1140: \bqan
1141: \ell\leqT & \leq & s\leqT\smin+\sqrt{\frac{K}{2} s\leqT\smin}+\sqrt{2K s\leqT\smin}+3K+2
1142: +\sqrt{\frac{K}{2} s\leqT\smin}+
1143: \big(\mbox{$\frac{7}{2}$}K+\mbox{$\frac{1}{2\sqrt 2}$}\big)\ln(s\leqT\smin)\\
1144: &&+\frac{1}{\eta_1}+2
1145: \leq s\leqT\smin+2\sqrt{2K
1146: s\leqT\smin}+5K\ln(s\leqT\smin)+3K+6.
1147: \eqan
1148: This completes the proof.
1149: \qed
1150: 
1151: Theorem~\ref{thFPLLDynamic} and Theorem~\ref{thFPL2} $(i)$
1152: immediately imply the following bounds on the
1153: $\sqrt{\mbox{Loss}}$-regrets:
1154: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i+1}+\sqrt{8K}$,
1155: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i+1}+\sqrt{2}(k^i+1)$, and
1156: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i}+\sqrt{2}(k^i+2)$,
1157: respectively.
1158: 
1159: \paradot{Remark}
1160: The same analysis as for Theorems
1161: [\ref{thFPLStatic}--\ref{thFPL2}]$(ii)$ applies to general $\D$,
1162: using $\ell_t\leq \e^{\eta_t n}r_t$ instead of $\ell_t\leq
1163: \e^{\eta_t}r_t$, and leading to an additional factor $\sqrt{n}$ in
1164: the regret. Compare the remark at the end of Section~\ref{secFFPL}.
1165: 
1166: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1167: \section{Hierarchy of Experts}\label{secHierarchy}
1168: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1169: 
1170: We derived bounds which do not need prior knowledge of $L$ with
1171: regret $\propto\sqrt{TK}$ and $\propto\sqrt{s_{1:T}^i K}$ for a
1172: finite number of experts with equal penalty $K=k^i=\ln n$. For
1173: an infinite number of experts, unbounded expert-dependent complexity
1174: penalties $k^i$ are necessary (due to constraint $\sum_i
1175: \e^{-k^i}\leq 1$). Bounds for this case (without prior knowledge of
1176: $T$) with regret $\propto k^i\sqrt{T}$ and $\propto
1177: k^i\sqrt{s_{1:T}^i}$ have been derived. In this case, the
1178: complexity $k^i$ is no longer under the square root. Although
1179: this already implies Hannan consistency, i.e.\ the average per
1180: round regret tends to zero as $t\to\infty$, improved regret
1181: bounds $\propto\sqrt{Tk^i}$ and $\propto\sqrt{s_{1:T}^i k^i}$
1182: are desirable and likely to hold. We were not able to derive
1183: such improved bounds for FPL, but for a (slight) modification.
1184: We consider a two-level hierarchy of experts. First consider
1185: an FPL for the subclass of experts of complexity
1186: $K$, for each $K\in\SetN$. Regard these FPL$^K$ as (meta) experts
1187: and use them to form a (meta) FPL. The class of meta experts now
1188: contains for each complexity only one (meta) expert, which allows
1189: us to derive good bounds. In the following, quantities referring
1190: to complexity class $K$ are superscripted by $K$, and meta
1191: quantities are superscripted by $\;\widetilde{}$ .
1192: 
1193: Consider the class of experts $\E^K:=\{i:K-1<k^i\leq K\}$ of
1194: complexity $K$, for each $K\in\SetN$. FPL$^K$ makes randomized
1195: prediction
1196: $I_t^K:=\arg\min_{i\in\E^K}\{s_{<t}^i+\smash{k^i-q^i\over\eta_t^K}\}$
1197: with $\eta_t^K:=\sqrt{K/2t}$ and suffers loss $u_t^K:=s_t^{I_t^K}$
1198: at time $t$. Since $k^i\leq K$ $\forall i\in\E^k$ we can apply
1199: Theorem~\ref{thFPLTDynamic}$(ii)$ to FPL$^K$:
1200: \beq\label{eqFH}
1201:   E[u_{1:T}^K] \;=\; \ell_{1:T}^K \;\leq\; s_{1:T}^i+ 2\sqrt{2TK}
1202:   \quad \forall i\in\E^K
1203:   \quad \forall K\in\SetN.
1204: \eeq
1205: We now define a meta state $\tilde s_t^K=u_t^K$ and regard FPL$^K$
1206: for $K\in\SetN$ as meta experts, so meta expert $K$ suffers loss
1207: $\tilde s_t^K$. (Assigning expected loss $\tilde
1208: s_t^K=E[u_t^K]=\ell_t^K$ to FPL$^K$ would also work.) Hence the
1209: setting is again an expert setting and we define the meta
1210: $\widetilde{\mbox{FPL}}$ to predict $\tilde
1211: I_t:=\arg\min_{K\in\SetN}\{\tilde s_{<t}^K+{\tilde k^K-\tilde
1212: q^K\over \tilde\eta_t}\}$ with $\tilde\eta_t=1/\sqrt{t}$ and
1213: $\tilde k^K=\odt+2\ln K$ (implying $\sum_{K=1}^\infty \e^{-\tilde
1214: k^K}\leq 1$). Note that $\tilde s_{1:t}^K=\tilde s_1^K+...+\tilde
1215: s_t^K= s_1^{I_1^K}+...+s_t^{I_t^K}$ sums over the same meta state
1216: components $K$, but over different components ${I_t^K}$ in normal
1217: state representation.
1218: 
1219: By Theorem~\ref{thFPLTDynamic}$(i)$ the $\tilde q$-expected loss
1220: of $\widetilde{\mbox{FPL}}$ is bounded by $\tilde s_{1:T}^K +
1221: \sqrt{T}(\tilde k^K+2)$. As this bound holds for all $q$ it also holds
1222: in $q$-expectation. So if we define $\tilde\ell_{1:T}$ to be the
1223: $q$ {\em and} $\tilde q$ expected loss of
1224: $\widetilde{\mbox{FPL}}$, and chain this bound with (\ref{eqFH})
1225: for $i\in\E^K$ we get:
1226: \bqan
1227:   \tilde\ell_{1:T}
1228:   &\leq& E[\tilde s_{1:T}^K + \sqrt{T}(\tilde k^K\!+2)]
1229:    \;=\; \ell_{1:T}^K + \sqrt{T}(\tilde k^K\!+2) \\
1230:   &\leq& s_{1:T}^i+ \sqrt{T}[2\sqrt{2(k^i\!+1)}+\odt+2\ln (k^i\!+1)+2],
1231: \eqan
1232: where we have used $K\leq k^i+1$. This bound is valid for all $i$
1233: and has the desired regret $\propto\sqrt{T k^i}$. Similarly we can
1234: derive regret bounds $\propto\sqrt{s_{1:T}^i k^i}$ by exploiting
1235: that the bounds in Theorems~\ref{thFPLLDynamic} and \ref{thFPL2}
1236: are concave in $s_{1:T}^i$ and using Jensen's inequality.
1237: 
1238: \begin{theorem}[Hierarchical FPL bound for dynamic $\eta_t$]\label{thHFPL}
1239: The hierarchical $\widetilde{\mbox{FPL}}$ employs at time $t$
1240: the prediction of expert $i_t:=I_t^{\tilde I_t}$, where
1241: \vspace{-0.5ex}\beqn
1242:   I_t^K:=\mathop{\arg\min}_{i:\lceil k^i\rceil=K}
1243:     \Big\{s_{<t}^i+{\textstyle{k^i-q^i\over\eta_t^K}}\Big\}
1244:   \qmbox{and}
1245:   \tilde I_t:=\mathop{\arg\min}_{K\in\SetN}
1246:     \Big\{s_1^{I_1^K}+...+s_{t-1}^{I_{t-1}^K}+
1247:     {\textstyle{{1\over 2}+2\ln\!K -\tilde q^K\over \tilde\eta_t}}\Big\}
1248:   \vspace{-1.5ex}
1249: \eeqn
1250: Under assumptions (\ref{eq:Assumptions}) and independent $P[\tilde q^K]=\e^{-\tilde
1251: q^K}$ $\forall K\in\SetN$, the
1252: expected loss $\tilde\ell_{1:T}=E[s_1^{i_1}+...+s_T^{i_T}]$ of
1253: $\widetilde{\mbox{FPL}}$ is bounded as follows:
1254: \bqan
1255:   a) & & \nq
1256:   \mbox{For}\quad \eta_t^K=\sqrt{K/2t}
1257:   \qmbox{and} \tilde\eta_t=1/\sqrt{t}
1258:   \qmbox{we have}
1259: \\
1260:    & & \nq
1261:   \tilde\ell_{1:T}
1262:   \;\leq\; s_{1:T}^i + 2\sqrt{2Tk^i}\!\cdot\!\big(1+O({\textstyle{\ln k^i\over \sqrt{k^i}}})\big)
1263:   \quad\forall i.
1264: \\
1265:   b) & & \nq
1266:   \mbox{For $\tilde\eta_t$ as in $(i)$ and $\eta_t^K$ as in $(ii)$
1267:   of Theorem $\{{\ref{thFPLLDynamic}\atop\ref{thFPL2}}\}$ we have}
1268: \\
1269:     & & \nq
1270:   \tilde\ell_{1:T}
1271:   \;\leq\; s_{1:T}^i + 2\sqrt{2s_{1:T}^i k^i}\!\cdot\!\big(1+O({\textstyle{\ln k^i\over \sqrt{k^i}}})\big)
1272:   + {\textstyle\big\{{O(k^i)\atop O(k^i\ln s_{1:T}^i)}\big\}}
1273:   \quad\forall i.
1274: \eqan
1275: \end{theorem}
1276: %
1277: The hierarchical $\widetilde{\mbox{FPL}}$ differs from a
1278: direct FPL over all experts $\E$. One potential way to prove a
1279: bound on direct FPL may be to show (if it holds) that FPL
1280: performs better than $\widetilde{\mbox{FPL}}$, i.e.\ $\ell_{1:T}\leq
1281: \tilde\ell_{1:T}$. Another way may be to suitably generalize
1282: Theorem~\ref{thFIFPL} to expert dependent $\eta$.
1283: 
1284: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1285: \section{Lower Bound on FPL}\label{secLowFPL}
1286: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1287: 
1288: A lower bound on FPL similar to the upper bound in Theorem
1289: \ref{thIFPL} can also be proven.
1290: 
1291: \begin{theorem}[FPL lower-bounded by BEH]\label{thLowFPL}
1292: Let $n$ be finite. Assume $\D\subseteq\SetR^n$ and $s_t\in\SetR^n$
1293: are chosen such that the required extrema exist (possibly
1294: negative), $q\in\SetR^n$, and $\eta_t>0$ is a
1295: decreasing sequence. Then the loss of FPL for uniform
1296: complexities (l.h.s.) can be lower-bounded in terms of the best
1297: predictor in hindsight (first term on r.h.s.) plus/minus additive
1298: corrections:
1299: \beqn
1300:   \sum_{t=1}^T M(s_{<t}-{q\over\eta_t})\scp s_t
1301:   \geq \min_{d\in\D}\{d\scp s_{1:T}\}
1302:      - {1\over\eta_T}\max_{d\in\D}\{d\scp q\}
1303:      + \sum_{t=1}^T ({1\over\eta_t}\!-\!{1\over\eta_{t-1}}) M(s_{<t})\scp q
1304: \eeqn
1305: \end{theorem}
1306: 
1307: \paradot{Proof}
1308: For notational convenience, let $\eta_0=\infty$ and
1309: $\tilde s\leqt=s\leqt-\frac{q}{\eta_t}$. Consider the losses
1310: $\tilde s_t=s_t-q\big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\big)$
1311: for the moment. We first show by induction on $T$ that the
1312: predictor $M(\tilde s_{<t})$ has nonnegative regret, i.e.\
1313: \beq\label{eqposregret}
1314:   \sum_{t=1}^T M(\tilde s_{<t})\scp\tilde s_t \geq M(\tilde s_{1:T})\scp\tilde s_{1:T}.
1315: \eeq
1316: For $T=1$ this follows immediately from minimality of $M$
1317: ($\tilde s_{<1}:=0$). For the induction step from $T-1$ to $T$ we
1318: need to show
1319: \beqn
1320:   M(\tilde s_{<T})\scp \tilde s_T \geq M(\tilde s_{1:T})\scp \tilde s_{1:T} -
1321:   M(\tilde s_{<T})\scp \tilde s_{<T}.
1322: \eeqn
1323: Due to $\tilde s_{1:T}=\tilde s_{<T}+\tilde s_T$, this is
1324: equivalent to $M(\tilde s_{<T})\scp \tilde s_{1:T} \geq M(\tilde
1325: s_{1:T})\scp \tilde s_{1:T}$, which holds by minimality of
1326: $M$. Rearranging terms in (\ref{eqposregret}) we obtain
1327: \beq\label{eqifpl2l}
1328:   \sum_{t=1}^T M(\tilde s_{<t})\scp s_t
1329:   \geq M(\tilde s_{1:T})\scp \tilde s\leqT
1330:    + \sum_{t=1}^T M(\tilde s_{<t})\scp q
1331:    \Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big), \quad\mbox{with}
1332: \eeq
1333: \beqn
1334:   M(\tilde s_{1:T})\scp \tilde s\leqT=
1335:   M(s_{1:T}-\frac{q}{\eta_T})\scp s_{1:T}
1336:   -M(s_{1:T}-\frac{q}{\eta_T})\scp \frac{q}{\eta_T}
1337:   \geq \min_{d\in\D}\{d\scp s_{1:T}\}
1338:   - {1\over\eta_T}\max_{d\in\D}\{d\scp q\}
1339: \eeqn
1340: \beqn
1341: \mbox{and}\quad \sum_{t=1}^T M(\tilde s_{<t})\scp q
1342: \Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big)
1343: \;\geq\; \sum_{t=1}^T \Big({1\over\eta_t}-{1\over\eta_{t-1}}\Big)M(s_{<t})\scp q
1344: \eeqn
1345: Again, the last bound follows from the minimality of $M$, which
1346: asserts that $[M(s-q)-M(s)]\scp s\geq 0\geq
1347: [M(s-q)-M(s)]\scp(s-q)$ and thus implies that $M(s-q)\scp q\geq
1348: M(s)\scp q$. So Theorem \ref{thLowFPL} follows from (\ref{eqifpl2l}).
1349: \qed
1350: 
1351: Assuming $q$ random with $E[q^i]=1$ and taking the expectation in
1352: Theorem~\ref{thLowFPL}, the last term reduces to
1353: $\sum_t({1\over\eta_t}-{1\over\eta_{t-1}})\sum_i
1354: M(s_{<t})^i$.
1355: If $\D\geq 0$, the term is positive and may be dropped. In case of
1356: $\D=\E$ or $\Delta$, the last term is identical to
1357: ${1\over\eta_T}$ (since $\sum_i d^i=1$) and keeping it improves
1358: the bound.
1359: %
1360: Furthermore, we need to evaluate the expectation of the second to
1361: last term in Theorem~\ref{thLowFPL}, namely
1362: $E[\max_{d\in\D}\{d\scp q\}]$. For $\D=\E$ and $q$ being
1363: exponentially distributed, using Lemma~\ref{lemExpMax} with
1364: $k^i=0$ $\forall i$, the expectation is bounded by $1+\ln n$.
1365: We hence get the following lower bound:
1366: 
1367: \begin{corollary}[FPL lower-bounded by BEH]\label{corLowFPL}
1368: For $\D=\E$ and any $\S$ and all $k^i$ equal and
1369: $P[q^i]=\e^{-q^i}$ for $q\geq 0$ and decreasing $\eta_t>0$, the
1370: expected loss of FPL is at most
1371: $\ln n/\eta_T$ lower than the loss of the best expert in hindsight:
1372: \beqn
1373:   \ell_{1:T} \;\geq\; s_{1:T}^{min} - {\ln n\over\eta_T}
1374: \eeqn
1375: \end{corollary}
1376: 
1377: The upper and lower bounds on $\ell_{1:T}$
1378: (Theorem~\ref{thFIFPL} and Corollaries~\ref{corIFPL} and
1379: \ref{corLowFPL}) together show that
1380: \beq\label{eqltos}
1381:   {\ell_{1:t}\over s_{1:t}^{min}} \to 1
1382:   \quad\qmbox{if}\quad
1383:   \eta_t\to 0
1384:   \qmbox{and}
1385:   \eta_t\!\cdot\!s_{1:t}^{min} \to \infty
1386:   \qmbox{and}
1387:   k^i=K\;\forall i
1388: \eeq
1389: For instance, $\eta_t=\sqrt{K/2 s_{<t}^{min}}$. For
1390: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}$ we proved the bound in Theorem
1391: \ref{thFPLLDynamic}$(ii)$. Knowing that $\sqrt{K/2(\ell_{<t}+1)}$
1392: converges to $\sqrt{K/2 s_{<t}^{min}}$ due to (\ref{eqltos}), we
1393: can derive a bound similar to Theorem~\ref{thFPLLDynamic}$(ii)$
1394: for $\eta_t=\sqrt{K/2 s_{<t}^{min}}$. This choice for $\eta_t$ has
1395: the advantage that we do not have to compute $\ell_{<t}$ (cf.\
1396: Section~\ref{secComp}), as also achieved by Theorem~\ref{thFPL2}$(ii)$.
1397: 
1398: We do not know whether Theorem~\ref{thLowFPL} can be
1399: generalized to expert dependent complexities $k^i$.
1400: 
1401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1402: \section{Adaptive Adversary}\label{secAdap}
1403: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1404: 
1405: In this section we show that bounds that hold against an
1406: oblivious adversary automatically also hold against an
1407: adaptive one.
1408: 
1409: %-------------------------------%
1410: \paradot{Initial versus independent randomization}
1411: %-------------------------------%
1412: So far we assumed that the perturbations $q$ are sampled only once at
1413: time $t=0$. As already indicated, under the expectation this is
1414: equivalent to generating a new perturbation $q_t$ at each time
1415: step $t$, i.e.\ Theorems \ref{thFIFPL}--\ref{thHFPL} remain valid
1416: for this case. While the former choice was favorable for the
1417: analysis, the latter has two advantages.
1418: %
1419: First, repeated sampling of the perturbations guarantees better
1420: bounds with high probability (see next section).
1421: %
1422: Second, if the losses are generated by an adaptive adversary (not
1423: to be confused with an adaptive learning rate) which has access to
1424: FPL's past decisions, then he may after some time figure out the
1425: initial random perturbation and use it to force FPL to have a
1426: large loss.
1427: %
1428: We now show that the bounds for FPL remain valid, even in case of
1429: an adaptive adversary, if independent randomization $q\leadsto
1430: q_t$ is used.
1431: 
1432: %-------------------------------%
1433: \paradot{Oblivious versus adaptive adversary}
1434: %-------------------------------%
1435: Recall the protocol for FPL: After each expert $i$ made its
1436: prediction $y_t^i$, and FPL combined them to form its own prediction
1437: $y_t^{\FPL}$, we observe $x_t$, and Loss($x_t,y_t^{\cdots}$) is
1438: revealed for FPL's and each expert's prediction. For independent
1439: randomization, we have $y_t^{\FPL}=y_t^{\FPL}(x_{<t},y_{1:t},q_t)$. For an
1440: oblivious (non-adaptive) adversary, $x_t=x_t(x_{<t},y_{<t})$.
1441: Recursively inserting and eliminating the experts
1442: $y_t^i=y_t^i(x_{<t},y_{<t})$ and $y_t^{\FPL}$, we get the dependencies
1443: \beq\label{eqnAdapDep}
1444:   u_t :=\mbox{Loss}(x_t,y_t^{\FPL}) = u_t(x_{1:t},q_t)
1445:   \qmbox{and}
1446:   s_t^i := \mbox{Loss}(x_t,y_t^i) = s_t^i(x_{1:t}),
1447: \eeq
1448: where $x_{1:t}$ is a ``fixed'' sequence.
1449: With this notation, Theorems \ref{thFPLStatic}--\ref{thFPL2} read
1450: $\ell_{1:T}\equiv E[\sum_{t=1}^T u_t(x_{1:t},q_t)]\leq f(x_{1:T})$
1451: for all $x_{1:T}\in\X^T$, where $f(x_{1:T})$ is one of the
1452: r.h.s.\ in Theorems \ref{thFPLStatic}--\ref{thFPL2}. Noting that
1453: $f$ is independent of $q_{1:T}$, we can write this as
1454: \beq\label{eqnAdapBnd}\label{defAt}
1455:   A_1\leq 0, \qmbox{where}
1456:   A_t(x_{<t},q_{<t})
1457:   := \max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big],
1458: \eeq
1459: where $E_{q_{t:T}}$ is the expectation w.r.t.\ $q_t...q_T$
1460: (keeping $q_{<t}$ fixed).
1461: 
1462: For an adaptive adversary, $x_t=x_t(x_{<t},y_{<t},y_{<t}^{\FPL})$
1463: can additionally depend on $y_{<t}^{\FPL}$. Eliminating $y_t^i$
1464: and $y_t^{\FPL}$ we get, again, (\ref{eqnAdapDep}), but
1465: $x_t=x_t(x_{<t},q_{<t})$ is no longer fixed, but an (arbitrary)
1466: random function. So we have to replace $x_t$ by
1467: $x_t(x_{<t},q_{<t})$ in (\ref{eqnAdapBnd}) for $t=1..T$. The
1468: maximization is now a functional maximization over all functions
1469: $x_t(\cdot,\cdot)...x_T(\cdot,\cdot)$. Using ``$\max_{x(\cdot)}E_q
1470: [g(x(q),q)]=E_q\max_x[g(x,q)]$,$\!$'' we can write this as
1471: \beq\label{defBt}
1472:   B_1\stackrel?\leq 0, \qmbox{where}
1473:   B_t(x_{<t},q_{<t})
1474:   := \max_{x_t}E_{q_t}...\max_{x_T}E_{q_T}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big],
1475: \eeq
1476: So, establishing $B_1\leq 0$ would show that all bounds
1477: also hold in the adaptive case.
1478: 
1479: \begin{lemma}[Adaptive=Oblivious]\label{lemAdap}
1480: Let $q_1...q_T\in\SetR^T$ be independent random variables,
1481: $E_{q_t}$ be the expectation w.r.t.\ $q_t$, $f$ any function of
1482: $x_{1:T}\in\X^T$, and $u_t$ arbitrary functions of $x_{1:t}$ and $q_t$.
1483: Then, $A_t(x_{<t},q_{<t})=B_t(x_{<t},q_{<t})$ for all $1\leq t\leq T$, where
1484: $A_t$ and $B_t$ are defined in (\ref{defAt}) and (\ref{defBt}).
1485: In particular, $A_1\leq 0$ implies $B_1\leq 0$.
1486: \end{lemma}
1487: %
1488: \paradot{Proof} We prove $B_t=A_t$ by induction on $t$, which
1489: establishes the theorem. $B_T=A_T$ is obvious. Assume $B_t=A_t$.
1490: Then
1491: \bqan
1492:   B_{t-1} &=& \max_{x_{t-1}}E_{q_{t-1}}B_t \;=\; \max_{x_{t-1}}E_{q_{t-1}} A_t
1493: \\
1494:   &=& \max_{x_{t-1}}E_{q_{t-1}}
1495:       \bigg[\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]\bigg]
1496: \\
1497:   &=& \max_{x_{t-1}}E_{q_{t-1}}
1498:       \bigg[\underbrace{\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau)}_{\hspace*{-3ex}\text{independent } x_{t:T} \text{ and } q_{t:T}} +
1499:             \underbrace{\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]}_{\text{independent $q_{t-1}$, since the $q_t$ are i.d.}} \bigg]
1500: \\
1501:   &=&
1502:   \max_{x_{t-1}} \bigg[\overbrace{E_{q_{t-1}}\Big[\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau)\Big]} +
1503:   \overbrace{\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]\bigg]}
1504: \\
1505:   &=&
1506:   \max_{x_{t-1}}\max_{x_{t:T}}E_{q_{t:T}} \bigg[E_{q_{t-1}}\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau) +
1507:   \sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\bigg]
1508:   \;\;=\;\; A_{t-1}
1509: \eqan\qed
1510: 
1511: \begin{corollary}[FPL Bounds for adaptive adversary]\label{corAdap}
1512: Theorems \ref{thFPLStatic}--\ref{thFPL2} also hold for an adaptive
1513: adversary in case of independent randomization $q\leadsto q_t$.
1514: \end{corollary}
1515: 
1516: Lemma \ref{lemAdap} shows that every bound of the form
1517: $A_1\leq 0$ proven for an oblivious adversary, implies an
1518: analogous bound
1519: $B_1\leq 0$ for an adaptive adversary. Note that this strong
1520: statement holds only for the \emph{full observation game},
1521: i.e.\ if after each time step we learn all losses. In partial
1522: observation games such as the Bandit case \citep{Auer:95}, our
1523: actual action may depend on our past action by means of our
1524: past observation, and the assertion no longer holds. In this
1525: case, FPL with an adaptive adversary can be analyzed as shown
1526: by \citet{McMahan:04,Poland:05actexp}.
1527: %
1528: Finally, $y_t^{\IFPL}$ can additionally depend on $x_t$, but the
1529: ``reduced'' dependencies (\ref{eqnAdapDep}) are the same as for
1530: FPL, hence, IFPL bounds also hold for adaptive adversary.
1531: 
1532: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1533: \section{Miscellaneous}\label{secMisc}\label{secComp}
1534: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1535: 
1536: %-------------------------------%
1537: \paradot{Bounds with high probability}
1538: %-------------------------------%
1539: We have derived several bounds for the expected loss $\ell_{1:T}$
1540: of FPL. The {\em actual} loss at time $t$ is
1541: $u_t=M(s_{<t}+{k-q\over\eta_t})\scp s_t$. A simple Markov inequality shows
1542: that the total actual loss $u_{1:T}$ exceeds
1543: the total expected loss $\ell_{1:T}=E[u_{1:T}]$ by a factor of
1544: $c>1$ with probability at most $1/c$:
1545: \beqn
1546:   P[u_{1:T}\geq c\!\cdot\!\ell_{1:T}]
1547:   \;\leq\; {1/c}.
1548: \eeqn
1549: Randomizing independently for each $t$ as described in the
1550: previous Section, the actual loss is
1551: $u_t=M(s_{<t}+{k-q_t\over\eta_t})\scp s_t$ with the same expected loss
1552: $\ell_{1:T}=E[u_{1:T}]$ as before. The advantage of independent
1553: randomization is that we can get a much better
1554: high-probability bound. We can exploit a Chernoff-Hoeffding
1555: bound \citep[Cor.5.2b]{McDiarmid:89}, valid for arbitrary
1556: independent random variables $0\leq u_t\leq 1$ for
1557: $t=1,...,T$:
1558: \beqn
1559:   P\Big[|u_{1:T}-E[u_{1:T}]|\geq\delta E[u_{1:T}]\Big]
1560:   \;\leq\; 2\exp(-{\textstyle{1\over 3}}\delta^2 E[u_{1:T}]), \qquad 0\leq\delta\leq 1.
1561: \eeqn
1562: For $\delta=\sqrt{3c/\ell_{1:T}}$ we get
1563: \beq\label{eqCH}
1564:   P[|u_{1:T}-\ell_{1:T}|\geq\sqrt{3c\ell_{1:T}}]
1565:   \;\leq\; 2\e^{-c}
1566:   \qmbox{as soon as}
1567:   \ell_{1:T}\geq 3c.
1568: \eeq
1569: Using (\ref{eqCH}), the bounds for $\ell_{1:T}$ of Theorems
1570: \ref{thFPLStatic}--\ref{thFPL2} can be rewritten to yield
1571: similar bounds with high probability ($1-2\e^{-c}$) for $u_{1:T}$
1572: with small extra regret $\propto\sqrt{c\cdot L}$ or $\propto\sqrt{c\cdot
1573: s_{1:T}^i}$.
1574: %
1575: Furthermore, (\ref{eqCH}) shows that with probability 1,
1576: $u_{1:T}/\ell_{1:T}$ converges rapidly to 1 for
1577: $\ell_{1:T}\to\infty$. Hence we may use the easier to compute
1578: $\eta_t=\sqrt{K/2u_{<t}}$ instead of
1579: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}$, likely with similar bounds on the
1580: regret.
1581: 
1582: %-------------------------------%
1583: \paradot{Computational Aspects}
1584: %-------------------------------%
1585: It is easy to generate the randomized decision of FPL. Indeed,
1586: only a single initial exponentially distributed vector
1587: $q\in\SetR^n$ is needed. Only for self-confident $\eta_t\propto
1588: 1/\sqrt{\ell_{<t}}$ (see Theorem~\ref{thFPLLDynamic}) we need to
1589: compute expectations explicitly. Given $\eta_t$, from $t\leadsto
1590: t+1$ we need to compute $\ell_t$ in order to update $\eta_t$. Note
1591: that $\ell_t=w_t\!\scp s_t$, where $w_t^i=P[I_t=i]$ and
1592: $I_t:=\arg\min_{i\in\E}\{s_{<t}^i+{k^i-q^i\over\eta_t}\}$ is the
1593: actual (randomized) prediction of FPL. With $s:=s_{<t}+k/\eta_t$,
1594: $P[I_t=i]$ has the following representation:
1595: \bqan
1596:   P[I_t=i]
1597:   &=& P[s-{q^i\over\eta_t}\leq s-{q^j\over\eta_t} \;\forall j\neq i] \\
1598:   &=& \int P[s-{q^i\over\eta_t}=m \;\wedge\; s-{q^j\over\eta_t}\geq m \;\forall j\neq i]dm \\
1599:   &=& \int P[q^i=\eta_t(s^i-m)]\cdot\prod_{j\neq i}P[q^j\leq \eta_t(s^j-m)]dm \\[-1ex]
1600:   &=& \int_{-\infty}^{s^{min}} \eta_t \e^{-\eta_t(s^i-m)}
1601:       \prod_{j\neq i}(1-\e^{-\eta_t(s^j-m)})dm \\
1602:   &=& \sum_{{\cal M}:\{i\}\subseteq{\cal M}\subseteq{\cal N}}\!\!
1603:   {\textstyle{(-)^{|{\cal M}|-1}\over|{\cal M}|}}\e^{-\eta_t\sum_{j\in\cal M}(s^j-s^{min})}
1604: \eqan
1605: In the last equality we expanded the product and performed the
1606: resulting exponential integrals. For finite $n$, the
1607: second to last one-dimensional integral should be numerically
1608: feasible. Once the product $\prod_{j=1}^n(1-\e^{-\eta_t(s^j-m)})$
1609: has been computed in time $O(n)$, the argument of the integral can
1610: be computed for each
1611: $i$ in time $O(1)$, hence the overall time to compute $\ell_t$ is
1612: $O(c\cdot n)$, where $c$ is the time to numerically compute one
1613: integral. For infinite $n$, the last sum may be approximated
1614: by the dominant contributions. Alternatively, one can modify
1615: the algorithm by considering only a finite pool of experts in
1616: each time step; see next paragraph. The expectation may also
1617: be approximated by (Monte Carlo) sampling $I_t$ several times.
1618: 
1619: Recall that approximating $\ell\ltt$ can be avoided by
1620: using $s\ltt\smin$ (Theorem~\ref{thFPL2}) or $u\ltt$ (bounds with
1621: high probability) instead.
1622: 
1623: %-------------------------------%
1624: \paradot{Finitized expert pool}
1625: %-------------------------------%
1626: In the case of an infinite expert class, FPL has to compute a
1627: minimum over an infinite set in each time step, which is not
1628: directly feasible. One possibility to address this is to
1629: choose the experts from a \emph{finite pool} in each time
1630: step. This is the case in the algorithm of \citet{Gentile:03},
1631: and also discussed by \citet{Littlestone:94}. For FPL, we can
1632: obtain this behavior by introducing an \emph{entering time}
1633: $\tau^i\geq 1$ for each expert. Then expert $i$ is not
1634: considered for $i<\tau^i$. In the bounds, this leads to an
1635: additional $\frac{1}{\eta_T}$ in Theorem \ref{thIFPL} and
1636: Corollary \ref{corIFPL} and a further additional $\tau^i$ in
1637: the final bounds (Theorems \ref{thFPLStatic}--\ref{thFPL2}),
1638: since we must add the regret of the best expert in hindsight
1639: which has already entered the game and the best expert in
1640: hindsight at all. Selecting
1641: $\tau^i=k^i$ implies bounds for FPL with entering times similar to
1642: the ones we derived here. The details and proofs for this
1643: construction can be found in \citep{Poland:05actexp}.
1644: 
1645: %-------------------------------%
1646: \paradot{Deterministic prediction and absolute loss}
1647: %-------------------------------%
1648: Another use of $w_t$ from the second last paragraph is the following: If
1649: the decision space is $\D=\Delta$, then FPL may make a
1650: deterministic decision $d=w_t\in\Delta$ at time $t$ with bounds
1651: now holding for sure, instead of selecting $e_i$ with probability
1652: $w_t^i$. For example for the absolute loss $s_t^i=|x_t-y_t^i|$
1653: with observation $x_t\in[0,1]$ and predictions $y_t^i\in[0,1]$, a
1654: master algorithm predicting deterministically $w_t\!\scp
1655: y_t\in[0,1]$ suffers absolute loss $|x_t-w_t\!\scp y_t|\leq\sum_i
1656: w_t^i|x_t-y_t^i|=\ell_t$, and hence has the same (or better)
1657: performance guarantees as FPL. In general, masters can be chosen
1658: deterministic if prediction space $\Y$ and loss-function Loss$(x,y)$ are
1659: convex.
1660: %
1661: For $x_t,y_t^i\in\{0,1\}$, the absolute loss $|x_t-p_t|$ of a master
1662: deterministically predicting $p_t\in[0,1]$ actually coincides with
1663: the $p_t$-expected 0/1 loss of a master predicting 1 with
1664: probability $p_t$. Hence a regret bound for the absolute loss also
1665: implies the same regret for the 0/1 loss.
1666: 
1667: 
1668: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1669: \section{Discussion and Open Problems}\label{secConc}
1670: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1671: 
1672: How does FPL compare with other expert advice algorithms? We
1673: briefly discuss four issues, summarized in Table \ref{tabregconst}.
1674: 
1675: %-------------------------------%
1676: \paradot{Static bounds}
1677: %-------------------------------%
1678: Here the coefficient of the regret term $\sqrt{KL}$, referred to
1679: as the \emph{leading constant} in the sequel, is $2$ for FPL
1680: (Theorem~\ref{thFPLStatic}). It is thus a factor of $\sqrt 2$
1681: worse than the Hedge bound for arbitrary loss by
1682: \citet{Freund:97}, which is sharp in some sense \citep{Vovk:95}.
1683: This is the price one pays for FPL and its easy analysis for
1684: adaptive learning rate. There is evidence that this (worst-case)
1685: difference really exists and is not only a proof artifact.
1686: %-------------------------------%
1687: %\paradot{Special losses}
1688: %-------------------------------%
1689: For special loss functions, the bounds can sometimes be improved,
1690: e.g.\ to a leading constant of 1 in the static (randomized) WM
1691: case with 0/1 loss \citep{Cesa:97}\footnote{While FPL and Hedge and WMR
1692: \citep{Littlestone:94} can sample an expert without knowing its
1693: prediction, \citet{Cesa:97} need to know the experts' predictions.
1694: Note also that for many (smooth) loss-functions like the quadratic
1695: loss, finite regret can be achieved \citep{Vovk:90}.}.
1696: Because of the structure of the FPL algorithm however, it is
1697: questionable if corresponding bounds hold there.
1698: 
1699: %-------------------------------%
1700: \paradot{Dynamic bounds}
1701: %-------------------------------%
1702: Not knowing the right learning rate in advance usually costs a
1703: factor of $\sqrt 2$. This is true for Hannan's algorithm
1704: \citep{Kalai:03} as well as in all our cases. Also for binary
1705: prediction with uniform complexities and 0/1 loss, this result
1706: has been established recently -- \citet{Yaroshinsky:04} show a
1707: dynamic regret bound with leading constant $\sqrt 2(1+\eps)$.
1708: Remarkably, the best dynamic bound for a WM variant proven by
1709: \citet{Auer:02pea} has a leading constant $2\sqrt 2$, which
1710: matches ours. Considering the difference in the static case,
1711: we therefore conjecture that a bound with leading constant of
1712: $2$ holds for a dynamic Hedge algorithm.
1713: 
1714: %-------------------------------%
1715: \paradot{General weights}
1716: %-------------------------------%
1717: While there are several dynamic bounds for uniform weights, the
1718: only previous result for non-uniform weights we know of is
1719: \citep[Cor.16]{Gentile:03}, which gives the dynamic bound
1720: $\ell^{\mbox{\scriptsize Gentile}}\leqT\leq s^i\leqT+i+
1721: O\Big[\sqrt{(s^i\leqT+i)\ln(s^i\leqT+i)}\Big]$ for a $p$-norm
1722: algorithm for the absolute loss. This is comparable to our bound
1723: for rapidly decaying weights $w^i=\exp(-i)$, i.e.\ $k^i=i$. Our
1724: hierarchical FPL bound in Theorem \ref{thHFPL} $(b)$ generalizes
1725: this to arbitrary weights and losses and strengthens it, since
1726: both, asymptotic order and leading constant, are smaller.
1727: 
1728: It seems that the analysis of all experts algorithms, including
1729: Weighted Majority variants and FPL, gets more complicated for
1730: general weights together with adaptive learning rate, because the
1731: choice of the learning rate must account for both the weight of
1732: the best expert (in hindsight) and its loss. Both quantities are
1733: not known in advance, but may have a different impact on the
1734: learning rate: While increasing the current loss estimate always
1735: decreases $\eta_t$, the optimal learning rate for an expert with
1736: higher complexity would be larger. On the other hand, all analyses
1737: known so far require decreasing $\eta_t$. Nevertheless we
1738: conjecture that the bounds $\propto\sqrt{Tk^i}$ and
1739: $\propto\sqrt{s_{1:T}^i k^i}$ also hold without the hierarchy
1740: trick, probably by using expert dependent learning rate
1741: $\eta_t^i$.
1742: 
1743: \begin{table}[t]\centering\small
1744: \begin{tabular}{|c|c|c|c|c|}
1745:   \hline
1746:   $\eta$ & Loss & conjecture & Low.Bnd. & Upper Bound \\ \hline
1747:   static & 0/1 & 1            & 1?                         & 1 \citep{Cesa:97} \\
1748:   static & any & $\sqrt{2}$ ! & $\sqrt{2}$ \citep{Vovk:95} & $\sqrt{2}$ \cite[Hedge]{Freund:97}, 2 [FPL] \\
1749:  dynamic & 0/1 & $\sqrt{2}$   & 1? \citep{Hutter:03optisp} & $\sqrt{2}$ \cite{Yaroshinsky:04}, $2\sqrt{2}$ \cite[WM-Type?]{Auer:02pea} \\
1750:  dynamic & any & 2            & $\sqrt{2}$ \citep{Vovk:95} & $2\sqrt{2}$ [FPL], 2 \cite[Bayes]{Hutter:03optisp} \\
1751:   \hline
1752: \end{tabular}
1753:   \caption{\label{tabregconst}Comparison of the constants $c$ in regrets
1754:   $c\sqrt{\mbox{Loss}\times\ln n}$ for various settings and algorithms.}
1755: \end{table}
1756: 
1757: %-------------------------------%
1758: \paradot{Comparison to Bayesian sequence prediction}
1759: %-------------------------------%
1760: We can also compare the \emph{worst-case} bounds for FPL obtained
1761: in this work to similar bounds for \emph{Bayesian sequence
1762: prediction}. Let $\{\nu_i\}$ be a class of probability
1763: distributions over sequences and assume that the true sequence is
1764: sampled from $\mu\in\{\nu_i\}$ with complexity $k^\mu$ ($\sum_i
1765: \e^{-k^{\nu_i}}\leq 1$). Then it is known that the Bayes optimal
1766: predictor based on the $\e^{-k^{\nu_i}}$-weighted mixture of
1767: $\nu_i$'s has an expected total loss of at most
1768: $L^\mu+2\sqrt{L^\mu k^\mu}+2k^\mu$, where $L^\mu$ is the expected
1769: total loss of the Bayes optimal predictor based on $\mu$
1770: \citep[Thm.2]{Hutter:02spupper},
1771: \citep[Thm.3.48]{Hutter:04uaibook}. Using FPL, we obtained
1772: the same bound except for the leading order constant, but for
1773: any sequence independently of the assumption that it is
1774: generated by $\mu$. This is another indication that a PEA
1775: bound with leading constant 2 could hold. See
1776: \citet{Hutter:04bayespea},
1777: \citet[Sec.6.3]{Hutter:03optisp} and
1778: \citet[Sec.3.7.4]{Hutter:04uaibook} for a more detailed
1779: comparison of Bayes bounds with PEA bounds.
1780: 
1781: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1782: %         Bibliography        %
1783: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1784: 
1785: \begin{small}
1786: \begin{thebibliography}{ACBFS95}
1787: 
1788: \bibitem[ACBFS95]{Auer:95}
1789: P.~Auer, N.~Cesa-Bianchi, Y.~Freund, and R.~E. Schapire.
1790: \newblock Gambling in a rigged casino: The adversarial multi-armed bandit
1791:   problem.
1792: \newblock In {\em Proc. 36th Annual Symposium on Foundations of Computer
1793:   Science (FOCS 1995)}, pages 322--331, Los Alamitos, CA, 1995. IEEE Computer
1794:   Society Press.
1795: 
1796: \bibitem[ACBG02]{Auer:02pea}
1797: P.~Auer, N.~Cesa-Bianchi, and C.~Gentile.
1798: \newblock Adaptive and self-confident on-line learning algorithms.
1799: \newblock {\em Journal of Computer and System Sciences}, 64(1):48--75, 2002.
1800: 
1801: \bibitem[AG00]{Auer:00}
1802: P.~Auer and C.~Gentile.
1803: \newblock Adaptive and self-confident on-line learning algorithms.
1804: \newblock In {\em Proc. 13th Conf. on Computational Learning Theory}, pages
1805:   107--117. Morgan Kaufmann, San Francisco, CA, 2000.
1806: 
1807: \bibitem[CB97]{Cesa:97}
1808: N.~Cesa-Bianchi{ et al.}
1809: \newblock How to use expert advice.
1810: \newblock {\em Journal of the ACM}, 44(3):427--485, 1997.
1811: 
1812: \bibitem[FS97]{Freund:97}
1813: Y.~Freund and R.~E. Schapire.
1814: \newblock A decision-theoretic generalization of on-line learning and an
1815:   application to boosting.
1816: \newblock {\em Journal of Computer and System Sciences}, 55(1):119--139, 1997.
1817: 
1818: \bibitem[Gen03]{Gentile:03}
1819: C.~Gentile.
1820: \newblock The robustness of the p-norm algorithm.
1821: \newblock {\em Machine Learning}, 53(3):265--299, 2003.
1822: 
1823: \bibitem[Han57]{Hannan:57}
1824: J.~Hannan.
1825: \newblock Approximation to {Bayes} risk in repeated plays.
1826: \newblock In {\em Contributions to the Theory of Games 3}, pages 97--139.
1827:   Princeton University Press, 1957.
1828: 
1829: \bibitem[HP04]{Hutter:04expert}
1830: M.~Hutter and J.~Poland.
1831: \newblock Prediction with expert advice by following the perturbed leader for
1832:   general weights.
1833: \newblock In {\em Proc. 15th International Conf. on Algorithmic Learning Theory
1834:   ({ALT-2004})}, volume 3244 of {\em LNAI}, pages 279--293, Padova, 2004.
1835:   Springer, Berlin.
1836: 
1837: \bibitem[Hut03a]{Hutter:02spupper}
1838: M.~Hutter.
1839: \newblock Convergence and loss bounds for {Bayesian} sequence prediction.
1840: \newblock {\em IEEE Transactions on Information Theory}, 49(8):2061--2067,
1841:   2003.
1842: 
1843: \bibitem[Hut03b]{Hutter:03optisp}
1844: M.~Hutter.
1845: \newblock Optimality of universal {B}ayesian prediction for general loss and
1846:   alphabet.
1847: \newblock {\em Journal of Machine Learning Research}, 4:971--1000, 2003.
1848: 
1849: \bibitem[Hut04a]{Hutter:04bayespea}
1850: M.~Hutter.
1851: \newblock Online prediction -- {B}ayes versus experts.
1852: \newblock Technical report,
1853:   http://www.idsia.ch/$_{^\sim}$marcus/ai/bayespea.htm, July 2004.
1854: \newblock Presented at the EU PASCAL Workshop on Learning Theoretic and
1855:   Bayesian Inductive Principles (LTBIP-2004).
1856: 
1857: \bibitem[Hut04b]{Hutter:04uaibook}
1858: M.~Hutter.
1859: \newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
1860:   Algorithmic Probability}.
1861: \newblock Springer, Berlin, 2004.
1862: \newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.
1863: 
1864: \bibitem[KV03]{Kalai:03}
1865: A.~Kalai and S.~Vempala.
1866: \newblock Efficient algorithms for online decision.
1867: \newblock In {\em Proc. 16th Annual Conf. on Learning Theory ({COLT-2003})},
1868:   Lecture Notes in Artificial Intelligence, pages 506--521, Berlin, 2003.
1869:   Springer.
1870: 
1871: \bibitem[LW89]{Littlestone:89}
1872: N.~Littlestone and M.~K. Warmuth.
1873: \newblock The weighted majority algorithm.
1874: \newblock In {\em 30th Annual Symposium on Foundations of Computer Science},
1875:   pages 256--261, Research Triangle Park, NC, 1989. IEEE.
1876: 
1877: \bibitem[LW94]{Littlestone:94}
1878: N.~Littlestone and M.~K. Warmuth.
1879: \newblock The weighted majority algorithm.
1880: \newblock {\em Information and Computation}, 108(2):212--261, 1994.
1881: 
1882: \bibitem[MB04]{McMahan:04}
1883: H.~B. McMahan and A.~Blum.
1884: \newblock Online geometric optimization in the bandit setting against an
1885:   adaptive adversary.
1886: \newblock In {\em 17th Annual Conference on Learning Theory (COLT)}, volume
1887:   3120 of {\em LNCS}, pages 109--123. Springer, 2004.
1888: 
1889: \bibitem[McD89]{McDiarmid:89}
1890: C.~McDiarmid.
1891: \newblock On the method of bounded differences.
1892: \newblock {\em Surveys in Combinatorics}, 141, London Mathematical Society
1893:   Lecture Notes Series:148--188, 1989.
1894: 
1895: \bibitem[PH05]{Poland:05actexp}
1896: J.~Poland and M.~Hutter.
1897: \newblock Master algorithms for active experts problems based on increasing
1898:   loss values.
1899: \newblock In {\em Annual Machine Learning Conference of Belgium and the
1900:   Netherlands ({Benelearn-2005})}, Enschede, 2005.
1901: 
1902: \bibitem[Vov90]{Vovk:90}
1903: V.~G. Vovk.
1904: \newblock Aggregating strategies.
1905: \newblock In {\em Proc. 3rd Annual Workshop on Computational Learning Theory},
1906:   pages 371--383, Rochester, New York, 1990. ACM Press.
1907: 
1908: \bibitem[Vov95]{Vovk:95}
1909: V.~G. Vovk.
1910: \newblock A game of prediction with expert advice.
1911: \newblock In {\em Proc. 8th Annual Conf. on Computational Learning Theory},
1912:   pages 51--60. ACM Press, New York, NY, 1995.
1913: 
1914: \bibitem[YEYS04]{Yaroshinsky:04}
1915: R.~Yaroshinsky, R.~El-Yaniv, and S.~Seiden.
1916: \newblock How to better use expert advice.
1917: \newblock {\em Machine Learning}, 55(3):271--309, 2004.
1918: 
1919: \end{thebibliography}
1920: \end{small}
1921: 
1922: \end{document}
1923: 
1924: %--------------------End-of-Expertx.tex-----------------------%
1925: