0504:cs0504078/cs0504078

1:

2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3: % Adaptive Online Prediction by Following the Perturbed Leader%

4: %%      Marcus Hutter & Jan Poland: Start: December 2003     %%

5: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

6:

7: \documentclass[12pt,twoside]{article}

8: \usepackage{latexsym}

9: \topmargin=-1cm  \oddsidemargin=5mm \evensidemargin=5mm

10: \textwidth=15cm \textheight=22cm \unitlength=1mm

11: \sloppy\lineskip=0pt

12:

13: %-------------------------------%

14: %   Macro-Definitions           %

15: %-------------------------------%

16: \def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}

17: \def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip=  4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}

18: \def\textmuskip{\thinmuskip= 0mu                    \medmuskip=  1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}

19: %\def\dispmuskip{}\def\textmuskip{}    %normal math-spacing

20: \textmuskip

21: \def\beq{\dispmuskip\begin{equation}}    \def\eeq{\end{equation}\textmuskip}

22: \def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}

23: \def\bqa{\dispmuskip\begin{eqnarray}}    \def\eqa{\end{eqnarray}\textmuskip}

24: \def\bqan{\dispmuskip\begin{eqnarray*}}  \def\eqan{\end{eqnarray*}\textmuskip}

25: \newtheorem{theorem}{Theorem}

26: \newtheorem{corollary}[theorem]{Corollary}

27: \newtheorem{lemma}[theorem]{Lemma}

28: \newtheorem{definition}[theorem]{Definition}

29: \newenvironment{keywords}{\centerline{\bf\small

30: Keywords}\begin{quote}\small}{\par\end{quote}\vskip 1ex}

31: \def\citet{\cite}\def\citep{\cite}\def\citealt{\cite}\def\citeauthor{\cite}

32: \def\myparskip{\vspace{1.5ex plus 0.5ex minus 0.5ex}\noindent}

33: \def\paragraph#1{\myparskip{\bfseries\boldmath{#1.}}}

34: \def\paradot#1{\myparskip{\bfseries\boldmath{#1.}}}

35: \def\paranodot#1{\myparskip{\bfseries\boldmath{#1}}}

36: \def\eps{\varepsilon}

37: \def\nq{\hspace{-1em}}

38: \def\qed{\hspace*{\fill}$\Box\quad$\\}

39: \def\odt{{\textstyle{1\over 2}}}

40: \def\v{\boldsymbol}

41: \def\p{{\scriptscriptstyle+}}

42: \def\n{{n}}

43: \def\t{\pi}

44: \def\pin{{\scriptstyle\Pi}}

45: \def\Var{{\mbox{Var}}}

46: \def\Cov{{\mbox{Cov}}}

47: \def\SetR{I\!\!R}

48: \def\SetN{I\!\!N}

49: \def\N{{\cal N}}

50: \def\D{{\cal D}}

51: \def\S{{\cal S}}

52: \def\E{{\cal E}}

53: \def\X{{\cal X}}                        % input/perception set/alphabet

54: \def\Y{{\cal Y}}                        % output/action set/alphabet

55: \def\qmbox#1{{\quad\mbox{#1}\quad}}

56: \def\scp{{\scriptscriptstyle^{\,\circ}}}

57: \def\sooe{{\textstyle{1\over\eta}}}

58: \def\FPL{\text{FPL} }

59: \def\IFPL{\text{IFPL} }

60:

61: \def\leqt{_{1:t}}

62: \def\leqtj{_{1:{t_j}}}

63: \def\leqtjj{_{1:{t_{j-1}}}}

64: \def\leqT{_{1:T}}

65: \def\ltT{_{<T}}

66: \def\ltt{_{<t}}

67: \def\lttj{_{<{t_j}}}

68: \def\lttjj{_{<{t_{j-1}}}}

69: \def\leqn{_{1:n}}

70: \def\leqss{_{1:s}}

71: \def\smin{^{min}}

72:

73: \def\leqs#1{\stackrel {#1} \leq}

74: \def\text#1{\mbox{\scriptsize{#1}}}

75: \def\e{{\rm e}}                        % natural e

76:

77: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

78: %                      T i t l e - P a g e                      %

79: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

80:

81: \begin{document}

82: \title{\vskip -10mm\normalsize\sc Technical Report \hfill IDSIA-10-05

83: \vskip 2mm\bf\Large\hrule height5pt \vskip 6mm

84: Adaptive Online Prediction by \\ Following the Perturbed Leader

85: \vskip 6mm \hrule height2pt \vskip 5mm}

86: \author{{\bf Marcus Hutter} and {\bf Jan Poland}\\[3mm]

87: \normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland%

88: \thanks{This work was supported by SNF grant 2100-67712.02.\newline\hspace*{3.6ex}

89: A shorter version appeared in the proceedings of the ALT 2004 conference \citep{Hutter:04expert}.}\\

90: \normalsize \{marcus,jan\}@idsia.ch, \ http://www.idsia.ch/$^{_{_\sim}}\!$\{marcus,jan\} }

91: \date{14 April 2005}

92: \maketitle

93:

94: \begin{abstract}%

95: When applying aggregating strategies to Prediction with Expert

96: Advice, the learning rate must be adaptively tuned. The

97: natural choice of $\sqrt{\mbox{complexity/current loss}}$

98: renders the analysis of Weighted Majority derivatives quite

99: complicated. In particular, for arbitrary weights there have

100: been no results proven so far. The analysis of the alternative

101: ``Follow the Perturbed Leader'' (FPL) algorithm from Kalai and

102: Vempala (2003) based on Hannan's algorithm is easier. We

103: derive loss bounds for adaptive learning rate and both finite

104: expert classes with uniform weights and countable expert

105: classes with arbitrary weights. For the former setup, our loss

106: bounds match the best known results so far, while for the

107: latter our results are new.

108: \end{abstract}

109:

110: \begin{keywords}

111: Prediction with Expert Advice,

112: Follow the Perturbed Leader,

113: general weights,

114: adaptive learning rate,

115: adaptive adversary,

116: hierarchy of experts,

117: expected and high probability bounds,

118: general alphabet and loss,

119: online sequential prediction.

120: \end{keywords}

121:

122: \newpage

123: %------------------------------%

124: %      Table of Contents       %

125: %------------------------------%

126: \begin{quote}\begin{quote}

127: \def\contentsname{\normalsize \hfil Contents \hfil}

128: {\parskip=-2.5ex\tableofcontents}

129: \end{quote}\end{quote}

130:

131: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

132: \section{Introduction}\label{secInt}

133: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

134:

135: %-------------------------------%

136: %\paradot{Prediction with Expert Advice}

137: %-------------------------------%

138: In Prediction with Expert Advice (PEA) one considers an

139: ensemble of sequential predictors (experts). A master

140: algorithm is constructed based on the historical performance

141: of the predictors. The goal of the master algorithm is to

142: perform nearly as well as the best expert in the class, on any

143: sequence of outcomes. This is achieved by making (randomized)

144: predictions close to the better experts.

145:

146: %-------------------------------%

147: %\paradot{Historical Survey}

148: %-------------------------------%

149: PEA theory has rapidly developed in the recent past.

150: Starting with the Weighted Majority (WM) algorithm of

151: \citet{Littlestone:89,Littlestone:94} and the aggregating

152: strategy of \citet{Vovk:90}, a vast variety of different

153: algorithms and variants have been published. A key parameter

154: in all these algorithms is the \emph{learning rate}. While

155: this parameter had to be fixed in the early algorithms such as

156: WM, \citet{Cesa:97} established the so-called doubling trick

157: to make the learning rate coarsely adaptive. A little later,

158: incrementally adaptive algorithms were developed by

159: \citet{Auer:00,Auer:02pea,Yaroshinsky:04,Gentile:03}, and

160: others. In Section \ref{secConc}, we will compare our results

161: with these works more in detail. Unfortunately, the loss bound

162: proofs for the incrementally adaptive WM variants are quite

163: complex and technical, despite the typically simple and

164: elegant proofs for a static learning rate.

165:

166: %-------------------------------%

167: %\paradot{Adaptive Learning Rate}

168: %-------------------------------%

169: The complex growing proof techniques also had another consequence:

170: While for the original WM algorithm, assertions are proven for

171: countable classes of experts with arbitrary weights, the modern

172: variants usually restrict to finite classes with uniform weights

173: (an exception being \citet{Gentile:03}, see the discussion

174: section). This might be sufficient for many practical purposes but

175: it prevents the application to more general classes of predictors.

176: Examples are extrapolating (=predicting) data points with the help

177: of a polynomial (=expert) of degree $d=1,2,3,...$ --or-- the (from

178: a computational point of view largest) class of all computable

179: predictors. Furthermore, most authors have concentrated on

180: predicting \emph{binary} sequences, often with the 0/1 loss for

181: $\{0,1\}$-valued and the absolute loss for $[0,1]$-valued

182: predictions. Arbitrary losses are less common. Nevertheless, it is

183: easy to abstract completely from the predictions and consider the

184: resulting losses only. Instead of predicting according to a

185: ``weighted majority'' in each time step, one chooses one

186: \emph{single} expert with a probability depending on his past

187: cumulated loss. This is done e.g.\ by \citet{Freund:97}, where an

188: elegant WM variant, the Hedge algorithm, is analyzed.

189:

190: %-------------------------------%

191: %\paradot{Follow the Perturbed Leader}

192: %-------------------------------%

193: A different, general approach to achieve similar results is

194: ``Follow the Perturbed Leader'' (FPL). The principle dates

195: back to as early as 1957, now called Hannan's algorithm

196: \citep{Hannan:57}. In 2003, Kalai and Vempala published a

197: simpler proof of the main result of Hannan and also succeeded

198: to improve the bound by modifying the distribution of the

199: perturbation\nocite{Kalai:03}. The resulting algorithm (which

200: they call FPL*) has the same performance guarantees as the

201: WM-type algorithms for fixed learning rate, save for a factor

202: of $\sqrt 2$. A major advantage we will discover in this work

203: is that its analysis remains easy for an adaptive learning

204: rate, in contrast to the WM derivatives. Moreover, it

205: generalizes to online decision problems other than PEA.

206:

207: %-------------------------------%

208: %\paradot{What' new}

209: %-------------------------------%

210: In this work, we study the FPL algorithm for PEA. The problems of

211: WM algorithms mentioned above are addressed: Bounds on the

212: cumulative regret of the standard form $\sqrt{kL}$ (where $k$ is

213: the complexity and $L$ is the cumulative loss of the best expert

214: in hindsight) are shown for countable expert classes with

215: arbitrary weights, adaptive learning rate, and arbitrary losses.

216: Regarding the adaptive learning rate, we obtain proofs that are

217: simpler and more elegant than for the corresponding WM algorithms.

218: (In particular, the proof for a self-confident choice of the

219: learning rate, Theorem~\ref{thFPLLDynamic}, is less than half a

220: page.) Further, we prove the first loss bounds for \emph{arbitrary

221: weights} and adaptive learning rate. In order to obtain the

222: optimal $\sqrt{kL}$ bound in this case, we will need to introduce

223: a hierarchical version of FPL, while without hierarchy we show a

224: worse bound $k\sqrt{L}$. (For self-confident learning rate

225: together with uniform weights and arbitrary losses, one can prove

226: corresponding results for a variant of WM by adapting an argument

227: by \citealt{Auer:02pea}.)

228:

229: %-------------------------------%

230: %\paradot{Online, worst case and probabilities}

231: %-------------------------------%

232: PEA usually refers to an \emph{online worst case} setting: $n$

233: experts that deliver sequential predictions over a time range

234: $t=1,\ldots,T$ are given. At each time $t$, we know the actual

235: predictions and the \emph{past} losses. The goal is to give a

236: prediction such that the overall loss after $T$ steps is ``not

237: much worse'' than the best expert's loss \emph{on any sequence of

238: outcomes}. If the prediction is deterministic, then an adversary

239: could choose a sequence which provokes maximal loss. So we have to

240: \emph{randomize} our predictions. Consequently, we ask for a

241: prediction strategy such that the \emph{expected} loss on any

242: sequence is small.

243:

244: %-------------------------------%

245: %\paradot{Contents}

246: %-------------------------------%

247: This paper is structured as follows. In Section~\ref{secSetup}

248: we give the basic definitions. While \citeauthor{Kalai:03}

249: consider general online decision problems in

250: finite-dimensional spaces, we focus on online prediction tasks

251: based on a countable number of experts. Like \citet{Kalai:03}

252: we exploit the infeasible FPL predictor (IFPL) in our

253: analysis.

254: %

255: Sections~\ref{secIFPL} and \ref{secFFPL} derive the main

256: analysis tools. In Section~\ref{secIFPL} we generalize (and

257: marginally improve) the upper bound \citep[Lem.3]{Kalai:03} on

258: IFPL to arbitrary weights. The main difficulty we faced was to

259: appropriately distribute the weights to the various terms. For

260: the corresponding lower bound (Section~\ref{secLowFPL}) this

261: is an open problem.

262: %

263: In Section~\ref{secFFPL} we exploit our restricted setup to

264: significantly improve \citep[Eq.(3)]{Kalai:03} allowing for

265: bounds logarithmic rather than linear in the number of

266: experts.

267: %

268: The upper and lower bounds on IFPL are combined to derive

269: various regret bounds on FPL in Section~\ref{secBounds}.

270: Bounds for static and dynamic learning rate in terms of the

271: sequence length follow straight-forwardly. The proof of our

272: main bound in terms of the loss is much more elegant than the

273: analysis of previous comparable results.

274: %

275: Section~\ref{secHierarchy} proposes a novel hierarchical procedure

276: to improve the bounds for non-uniform weights.

277: %

278: In Section~\ref{secLowFPL}, a lower bound is established.

279: %

280: In Section~\ref{secAdap}, we consider the case of independent

281: randomization more seriously. In particular, we show that the

282: derived bounds also hold for an adaptive adversary.

283: %

284: Section~\ref{secMisc} treats some additional issues, including

285: bounds with high probability, computational aspects, deterministic

286: predictors, and the absolute loss.

287: %

288: Finally, in Section~\ref{secConc} we discuss our results, compare

289: them to references, and state some open problems.

290:

291: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

292: \section{Setup and Notation}\label{secSetup}

293: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

294:

295: %-------------------------------%

296: \paradot{Setup}

297: %-------------------------------%

298: Prediction with Expert Advice proceeds as follows. We are asked to

299: perform sequential predictions $y_t\in\Y$ at times $t=1,2,\ldots$.

300: At each time step $t$, we have access to the predictions

301: $(y_t^i)_{1\leq i\leq n}$ of $n$ experts $\{e_1,...,e_n\}$, where

302: the size of the expert pool is $n\in\SetN\cup\{\infty\}$. It is

303: convenient to use the same notation for finite ($n\in\SetN$) and

304: countably infinite ($n=\infty$) expert pool. After having made a

305: prediction, we make some observation $x_t\in\X$, and a Loss is

306: revealed for our and each expert's prediction. (E.g.\ the loss

307: might be 1 if the expert made an erroneous prediction and 0

308: otherwise. This is the 0/1 loss.) Our goal is to achieve a total

309: loss ``not much worse" than the best expert, after $t$ time steps.

310:

311: We admit $n\in\SetN\cup\{\infty\}$ experts, each of which is

312: assigned a known complexity $k^i\geq 0$. Usually we require

313: $\sum_i\e^{-k^i}\leq 1$, which implies that the $k^i$ are valid

314: lengths of prefix code words, for instance $k^i=\ln n$ if

315: $n<\infty$ or $k^i=\odt+2\ln i$ if $n=\infty$. Each complexity

316: defines a weight by means of $\smash{\e^{-k^i}}$ and vice versa.

317: In the following we will talk of complexities rather than of

318: weights. If $n$ is finite, then usually one sets $k^i= \ln n$ for

319: all $i$; this is the case of \emph{uniform complexities/weights}.

320: If the set of experts is countably infinite ($n=\infty$), uniform

321: complexities are not possible. The vector of all complexities is

322: denoted by $k=(k^i)_{1\leq i\leq n}$. At each time $t$, each

323: expert $i$ suffers a loss\footnote{The setup, analysis and results

324: easily scale to $s_t^i\in[0,S]$ for $S>0$ other than 1.}

325: $s_t^i=$Loss$(x_t,y_t^i)\in[0,1]$, and $s_t=(s_t^i)_{1\leq i\leq

326: n}$ is the vector of all losses at time $t$. Let

327: $s\ltt=s_1+\ldots+s_{t-1}$ (respectively $s\leqt=s_1+\ldots+s_t$)

328: be the total past loss vector (including current loss $s_t$) and

329: $s\leqt\smin=\min_i\{s\leqt^i\}$ be the loss of the \emph{best

330: expert in hindsight (BEH)}. Usually we do not know in advance the

331: time $t\geq 0$ at which the performance of our predictions are

332: evaluated.

333:

334: %-------------------------------%

335: \paradot{General decision spaces}

336: %-------------------------------%

337: The setup can be generalized as follows. Let $\S\subset\SetR^n$ be the

338: \emph{state space} and $\D\subset\SetR^n$ the \emph{decision

339: space}. At time $t$ the state is $s_t\in\S$, and a decision

340: $d_t\in\D$ (which is made before the state is revealed) incurs a

341: loss $d_t\!\scp s_t$, where ``$\scp$" denotes the inner product. This

342: implies that the loss function is \emph{linear} in the states.

343: Conversely, each linear loss function can be represented in this

344: way. The decision which minimizes the loss in state $s\in\S$ is

345: \beq\label{Mdef}

346:   M(s):=\arg\min_{d\in\D} \{d\scp s\}

347: \eeq

348: if the minimum exists. The application of this general framework

349: to PEA is straightforward: $\D$ is identified with the space of

350: all unit vectors $\E=\{e_i:1\leq i\leq n\}$, since a decision

351: consists of selecting a single expert, and $s_t\in[0,1]^n$, so

352: states are identified with losses. Only Theorems~\ref{thIFPL} and

353: \ref{thLowFPL} will be stated in terms of general decision space.

354: Our main focus is $\D=\E$. (Even for this special case, the scalar

355: product notation is not too heavy, but will turn out to be

356: convenient.) All our results generalize to the simplex

357: $\D=\Delta=\{v\in[0,1]^n:\sum_i v^i=1\}$, since the minimum of a

358: linear function on $\Delta$ is always attained on $\E$.

359:

360: %-------------------------------%

361: \paradot{Follow the Perturbed Leader}

362: %-------------------------------%

363: Given $s\ltt$ at time $t$, an immediate idea to solve the expert

364: problem is to ``Follow the Leader'' (FL), i.e.\ selecting the

365: expert $e_i$ which performed best in the past (minimizes

366: $s\ltt^i$), that is predict according to expert $M(s\ltt)$. This

367: approach fails for two reasons. First, for $n=\infty$ the minimum

368: in (\ref{Mdef}) may not exist. Second, for $n=2$ and

369: $s={\,0\,1\,0\,1\,0\,1 \ldots \choose \frac{1}{2}0\,1\,0\,1\,0

370: \ldots}$, FL always chooses the wrong prediction \citep{Kalai:03}.

371: We solve the first problem by penalizing each expert by its

372: complexity, i.e.\ predicting according to expert $M(s\ltt+k)$. The

373: \emph{FPL (Follow the Perturbed Leader)} approach solves the

374: second problem by adding to each expert's loss $s\ltt^i$ a random

375: perturbation.

376: %

377: We choose this perturbation to be negative \emph{exponentially

378: distributed}, either independent in each time step or once and for

379: all at the very beginning at time $t=0$. The former choice is

380: preferable in order to protect against an adaptive adversary who

381: generates the $s_t$, and in order to get bounds with high

382: probability (Section~\ref{secMisc}). For the main analysis

383: however, the latter choice is more convenient. Due to linearity of

384: expectations, these two possibilities are equivalent when dealing

385: with {\it expected losses} (this is straightforward for oblivious

386: adversary, for adaptive adversary see Section~\ref{secAdap}), so

387: we can henceforth assume without loss of generality one initial

388: perturbation $q$.

389:

390: %-------------------------------%

391: \paranodot{The FPL algorithm} is defined as follows:\\

392: %-------------------------------%

393: %

394: \hspace*{1cm}Choose random vector $q\stackrel{d.}{\sim}\exp$,

395:              i.e.\ $P[q^1...q^n]=\e^{-q^1}\cdot...\cdot\e^{-q^n}$ for $q\geq 0$.\\

396: \hspace*{1cm}For $t=1,...,T$\\

397: \hspace*{1cm}- Choose learning rate $\eta_t$.\\

398: \hspace*{1cm}- Output prediction of expert $i$ which minimizes $s_{<t}^i+(k^i-q^i)/\eta_t$.\\

399: \hspace*{1cm}- Receive loss $s_t^i$ for all experts $i$.

400:

401: \vspace{1.5ex}\noindent Other than $s\ltt$, $k$ and $q$, FPL

402: depends on the \emph{learning rate} $\eta_t$. We will give choices

403: for $\eta_t$ in Section~\ref{secBounds}, after having established

404: the main tools for the analysis. The expected loss at time $t$ of

405: FPL is $\ell_t:=E\big[M(s_{<t}+{k-q\over\eta_t})\scp s_t\big]$.

406: The key idea in the FPL analysis is the use of an intermediate

407: predictor \emph{IFPL} (for \emph{Implicit or Infeasible FPL}).

408: IFPL predicts according to $M(s\leqt+\smash{k-q\over\eta_t})$,

409: thus under the knowledge of $s_t$ (which is of course not

410: available in reality). By

411: $r_t:=E\big[M(s_{1:t}+\smash{k-q\over\eta_t})\scp s_t\big]$ we

412: denote the expected loss of IFPL at time $t$. The losses of IFPL

413: will be upper-bounded by BEH in Section~\ref{secIFPL} and

414: lower-bounded by FPL in Section~\ref{secFFPL}. Note that our

415: definition of the FPL algorithm deviates from that of

416: \citeauthor{Kalai:03}. It uses an exponentially distributed

417: perturbation similar to their FPL$^*$ but one-sided and a

418: non-stationary learning rate like Hannan's algorithm.

419:

420: %-------------------------------%

421: \paradot{Notes}

422: %-------------------------------%

423: Observe that we have stated the FPL algorithm regardless of the

424: actual \emph{predictions} of the experts and possible

425: \emph{observations}, only the \emph{losses} are relevant.

426: %

427: Note also that an expert can implement a highly complicated strategy

428: depending on past outcomes, despite its trivializing

429: identification with a constant unit vector. The complex expert's

430: (and environment's) behavior is summarized and hidden in the state

431: vector $s_t=$Loss$(x_t,y_t^i)_{1\leq i\leq n}$.

432: %

433: Our results therefore apply to \emph{arbitrary prediction and

434: observation spaces $\Y$ and $\X$ and arbitrary bounded loss

435: functions}.

436: This is in contrast to the major part of PEA work

437: developed for binary alphabet and 0/1 or absolute loss only.

438: %

439: Finally note that the setup allows for losses generated by an

440: adversary who tries to maximize the regret of FPL and knows the

441: FPL algorithm and all experts' past predictions/losses. If the

442: adversary also has access to FPL's past decisions, then FPL must

443: use independent randomization at each time step in order to

444: achieve good regret bounds.

445:

446: %-------------------------------%

447: \paradot{Motivation of FPL}

448: %-------------------------------%

449: Let $d(s_{<t})$ be any predictor with decision based on $s_{<t}$.

450: The following identity is easy to show:

451: \beq\label{eqFId}

452:   \underbrace{\sum_{t=1}^T d(s_{<t})\scp s_t}_{\text{``FPL''}}

453:   \;\equiv\;

454:   \underbrace{_{\rule{0ex}{3.8ex}}d(s_{1:T})\scp s_{1:T}}_{\text{``BEH''}}

455:   + \overbrace{\underbrace{\sum_{t=1}^T [d(s_{<t})\!-\!d(s_{1:t})]\scp s_{<t}}_{\text{``IFPL}-\text{BEH''}}}^{\text{$\leq 0$ if $d\approx M$}}

456:   + \overbrace{\underbrace{\sum_{t=1}^T [d(s_{<t})\!-\!d(s_{1:t})]\scp s_t}_{\text{``FPL}-\text{IFPL''}}}^{\text{small if $d(\cdot)$ is continuous}}

457: \eeq

458: For a good bound of FPL in terms of BEH we need the first term on

459: the r.h.s.\ to be close to BEH and the last two terms to be small.

460: The first term is close to BEH if $d\approx M$. The second to last

461: term is even negative if $d=M$, hence small if $d\approx M$. The

462: last term is small if $d(s_{<t})\approx d(s_{1:t})$, which is the

463: case if $d(\cdot)$ is a sufficiently smooth function.

464: Randomization smoothes the discontinuous function $M$: The

465: function $d(s):=E[M(s-q)]$, where $q\in\SetR^n$ is some random

466: perturbation, is a continuous function in $s$. If the mean and

467: variance of $q$ are small, then $d\approx M$, if the variance of

468: $q$ is large, then $d(s_{<t})\approx d(s_{1:t})$. An intermediate

469: variance makes the last two terms of (\ref{eqFId}) simultaneously

470: small enough, leading to excellent bounds for FPL.

471:

472: %-------------------------------%

473: \paradot{List of notation}\hfill\\

474: %-------------------------------%

475: $n\in\SetN\cup\{\infty\}$ ($n=\infty$ means countably infinite $\E$).\\

476: $x^i$ is $i$th component of vector $x\in\SetR^n$.\\

477: $\E:=\{e_i:1\leq i\leq n\}=$ set of unit vectors ($e_i^j=\delta_{ij}$).\\

478: $\Delta:=\{v\in[0,1]^n:\sum_i v^i=1\}$= simplex.\\

479: $s_t\in[0,1]^n$= environmental state/loss vector at time $t$.\\

480: $s_{1:t}:=s_1+...+s_t$= state/loss (similar for $\ell_t$ and $r_t$).\\

481: $s_{1:T}^{min}=\min_i\{s_{1:T}^i\}$= loss of Best Expert in Hindsight (BEH).\\

482: $s_{<t}:=s_1+...+s_{t-1}$= state/loss summary ($s_{<0}=0$).\\

483: $M(s):=\arg\min_{d\in\D}\{d\scp s\}$= best decision on $s$.\\

484: $T\in\SetN_0$= total time=step, $t\in\SetN$= current time=step.\\

485: $k^i\geq 0$= penalization = complexity of expert $i$.\\

486: $q\in\SetR^n$= random vector with independent exponentially distributed components.\\

487: $I_t:=\arg\min_{i\in\E}\{s_{<t}^i+{k^i-q^i\over\eta_t}\}$= randomized prediction of FPL.\\

488: $\ell_t:=E[M(s_{<t}+{k-q\over\eta_t})\scp s_t]$= expected loss at time $t$ of FPL (=$E[s_t^{I_t}]$ for $\D=\E$).\\

489: $r_t:=E[M(s_{1:t}+{k-q\over\eta_t})\scp s_t]$= expected loss at time $t$ of IFPL. \\

490: $u_t:=M(s_{<t}+{k-q\over\eta_t})\scp s_t$= actual loss at time $t$ of FPL (=$s_t^{I_t}$ for $\D=\E$).\\

491:

492: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

493: \section{IFPL bounded by Best Expert in Hindsight}\label{secExpMax}\label{secIFPL}

494: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

495:

496: In this section we provide tools for comparing the loss of IFPL

497: to the loss of the best expert in hindsight. The first result

498: bounds the expected error induced by the exponentially distributed

499: perturbation.

500:

501: \begin{lemma}[Maximum of Shifted Exponential Distributions]\label{lemExpMax}

502: Let $q^1,...,q^n$ be (not necessarily independent) exponentially

503: distributed random variables, i.e.\ $P[q^i]=\e^{-q^i}$ for

504: $q^i\geq 0$ and $1\leq i\leq n\leq\infty$, and $k^i\in\SetR$ be

505: real numbers with $u:=\sum_{i=1}^n\e^{-k^i}$. Then

506: \bqan

507:   P[\max_i\{q^i-k^i\}\geq a]

508:   &=& 1-\prod_{i=1}^n \max\{0,1\!-\!\e^{-a-k^i}\}

509:   \qmbox{if} q^1,...,q^n \;\mbox{are independent,}

510: \\

511:   P[\max_i\{q^i-k^i\}\geq a]

512:   &\leq& \min\{1,u\,\e^{-a}\},

513: \\

514:   E[\max_i\{q^i-k^i\}] &\leq& 1+\ln u.

515: \eqan

516: \end{lemma}

517:

518: \paradot{Proof} Using

519: \beqn

520:   P[q^i<a] = \max\{0,1\!-\!\e^{-a}\}\geq 1-\e^{-a}

521:   \qmbox{and}

522:   P[q^i\geq a] = \min\{1,\e^{-a}\}\leq \e^{-a},

523: \eeqn

524: valid for any $a\in\SetR$, the exact expression for $P[\max]$ in

525: Lemma~\ref{lemExpMax} follows from

526: \beqn

527:   P[\max_i\{q^i-k^i\}<a]

528:   = P[q^i-k^i<a\;\forall i]

529:   = \prod_{i=1}^n P[q^i<a+k^i]

530:   = \prod_{i=1}^n \max\{0,\e^{-a-k^i}\}

531: \eeqn

532: where the second equality follows from the independence of the

533: $q^i$. The bound on $P[\max]$ for any $a\in\SetR$ (including negative $a$) follows

534: from

535: \beqn

536:   P[\max_i\{q^i-k^i\}\geq a]

537:   = P[\exists i:q^i-k^i\geq a]

538:   \leq \sum_{i=1}^n P[q^i-k^i\geq a]

539:   \leq \sum_{i=1}^n \e^{-a-k^i} = u\!\cdot\!\e^{-a}

540: \eeqn

541: where the first inequality is the union bound.

542: Using $E[z]\leq E[\max\{0,z\}]=\int_0^\infty P[\max\{0,z\}\geq

543: y]dy = \int_0^\infty P[z\geq y]dy$ (valid for any real-valued

544: random variable $z$) for $z=\max_i\{q^i-k^i\}-\ln u$, this implies

545: \beqn

546:   E[\max_i\{q^i-k^i\}-\ln u]

547:   \leq \int_0^\infty P\big[\max_i\{q^i-k^i\}\geq y+\ln u \big]dy

548:   \leq \int_0^\infty \e^{-y} dy\ = \ 1,

549: \eeqn

550: which proves the bound on $E[\max]$.

551: \qed

552:

553: If $n$ is finite, a lower bound $E[\max_i q^i]\geq 0.57721+\ln n$

554: can be derived, showing that the upper bound on $E[\max]$ is quite

555: tight (at least) for $k^i=0$ $\forall i$.

556: %

557: The following bound generalizes \citep[Lem.3]{Kalai:03} to

558: arbitrary weights, establishing a relation between IFPL and

559: the best expert in hindsight.

560:

561: \begin{theorem}[IFPL bounded by BEH]\label{thIFPL}

562: Let $\D\subseteq\SetR^n$, $s_t\in\SetR^n$ for $1\leq t\leq T$

563: (both $\D$ and $s$ may even have negative components, but we assume that all

564: required extrema are attained), and $q,k\in\SetR^n$. If

565: $\eta_t>0$ is decreasing in $t$, then the loss of the infeasible FPL

566: knowing $s_t$ at time $t$ in advance (l.h.s.) can be bounded in

567: terms of the best predictor in hindsight (first term on r.h.s.) plus

568: additive corrections:

569: \beqn

570:   \sum_{t=1}^T M(s_{1:t}+{k\!-\!q\over\eta_t})\scp s_t

571:   \leq \min_{d\in\D}\{d\scp(s_{1:T}+{k\over\eta_T})\}

572:      + {1\over\eta_T}\max_{d\in\D}\{d\scp(q-k)\}

573:      - {1\over\eta_T} M(s_{1:T}+{k\over\eta_T})\scp q.

574: \eeqn

575: \end{theorem}

576:

577: Note that if $\D=\E$ (or $\D=\Delta$) and $s_t\geq 0$, then

578: all extrema in the theorem are attained almost surely. The

579: same holds for all subsequent extrema in the proof and

580: throughout the paper.

581:

582: \paradot{Proof} For notational convenience, let $\eta_0=\infty$ and

583: $\tilde s\leqt=s\leqt+\frac{k-q}{\eta_t}$. Consider the losses

584: $\tilde s_t=s_t+(k-q)\big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\big)$

585: for the moment. We first show by induction on $T$ that the infeasible

586: predictor $M(\tilde s_{1:t})$ has zero regret for any loss $\tilde

587: s$, i.e.\

588: \beq\label{eqnoregret}

589:   \sum_{t=1}^T M(\tilde s_{1:t})\scp \tilde s_t \leq M(\tilde s_{1:T})\scp \tilde s_{1:T}.

590: \eeq

591: For $T=1$ this is obvious. For the induction step from $T-1$ to $T$

592: we need to show

593: \beq\label{eq:noregret1}

594:   M(\tilde s_{1:T})\scp \tilde s_T \leq M(\tilde s_{1:T})\scp

595:   \tilde s_{1:T} - M(\tilde s_{<T})\scp \tilde s_{<T}.

596: \eeq

597: This follows from $\tilde s_{1:T}=\tilde s_{<T}+\tilde s_T$ and

598: $M(\tilde s_{1:T})\scp \tilde s_{<T} \geq M(\tilde s_{<T})\scp

599: \tilde s_{<T}$ by minimality of $M$.

600: Rearranging terms in (\ref{eqnoregret}), we obtain

601: \beq\label{eqifpl2}

602:   \sum_{t=1}^T M(\tilde s_{1:t})\scp s_t

603:   \ \leq\

604:   M(\tilde s_{1:T})\scp \tilde s_{1:T}- \sum_{t=1}^T M(\tilde s_{1:t})\scp

605:   (k-q)\Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big)

606: \eeq

607: Moreover, by minimality of $M$,

608: \bqa

609: \label{eqifpl4}

610: M(\tilde s_{1:T})\scp \tilde s_{1:T} & \leq &

611: M\Big(s_{1:T}+\frac{k}{\eta_T}\Big)\scp

612: \Big(s_{1:T}+\frac{k-q}{\eta_T}\Big)\\

613: \nonumber

614: & = & \min_{d\in\D}\left\{d\scp(s_{1:T}+{k\over\eta_T})\right\}-

615: M\Big(s_{1:T}+\frac{k}{\eta_T}\Big)\scp

616: \frac{q}{\eta_T}

617: \eqa

618: holds. Using ${1\over\eta_t}-{1\over\eta_{t-1}}\geq 0$ and again

619: minimality of $M$, we have

620: \bqa\label{eqifpl3}

621: \sum_{t=1}^T

622: ({1\over\eta_t}-{1\over\eta_{t-1}})M(\tilde s_{1:t})\scp(q-k) &

623: \leq & \sum_{t=1}^T

624: ({1\over\eta_t}-{1\over\eta_{t-1}})M(k-q)\scp(q-k)\\

625: \nonumber

626: &  = & {1\over\eta_T}M(k-q)\scp(q-k)

627: = {1\over\eta_T}\max_{d\in\D}\{d\scp(q-k)\}

628: \eqa

629: Inserting (\ref{eqifpl4}) and (\ref{eqifpl3}) back into (\ref{eqifpl2})

630: we obtain the assertion.

631: \qed

632:

633: Assuming $q$ random with $E[q^i]=1$ and taking the expectation in

634: Theorem~\ref{thIFPL}, the last term reduces to

635: $-{1\over\eta_T}\sum_{i=1}^n M(s_{1:T}+{k\over\eta_T})^i$.

636: If $\D\geq 0$, the term is negative and may be dropped. In case of

637: $\D=\E$ or $\Delta$, the last term is identical to

638: $-{1\over\eta_T}$ (since $\sum_i d^i=1$) and keeping it improves

639: the bound.

640: %

641: Furthermore, we need to evaluate the expectation of the second to

642: last term in Theorem~\ref{thIFPL}, namely

643: $E[\max_{d\in\D}\{d\scp(q-k)\}]$. For $\D=\E$ and $q$ being

644: exponentially distributed, using Lemma~\ref{lemExpMax}, the

645: expectation is bounded by $1+\ln u$. We hence get the following

646: bound:

647:

648: \begin{corollary}[IFPL bounded by BEH]\label{corIFPL}

649: For $\D=\E$ and $\sum_i \e^{-k^i}\leq 1$ and

650: $P[q^i]=\e^{-q^i}$ for $q\geq 0$ and decreasing $\eta_t>0$, the

651: expected loss of the infeasible FPL exceeds the loss of expert $i$

652: by at most $k^i/\eta_T$:

653: \beqn

654:   r_{1:T} \;\leq\; s_{1:T}^i + {1\over\eta_T}k^i  \quad\forall i.

655: \eeqn

656: \end{corollary}

657:

658: Theorem~\ref{thIFPL} can be generalized to expert

659: dependent factorizable $\eta_t\leadsto \eta_t^i=\eta_t\cdot\eta^i$

660: by scaling $k^i\leadsto k^i/\eta^i$ and $q^i\leadsto q^i/\eta^i$.

661: Using $E[\max_i\{{q^i-k^i\over\eta^i}\}]\leq

662: E[\max_i\{q^i-k^i\}]/\min_i\{\eta^i\}$, Corollary~\ref{corIFPL},

663: generalizes to

664: \beqn

665:     E[\sum_{t=1}^T M(s_{1:t}+{k-q\over\eta_t^i})\scp s_t]

666:     \;\leq\; s_{1:T}^i + {1\over\eta_T^i}k^i + {1\over\eta_T^{min}}

667:     \quad\forall i,

668: \eeqn

669: where $\eta_T^{min}:=\min_i\{\eta_T^i\}$.

670: For example, for $\eta_t^i=\sqrt{k^i/t}$

671: we get the desired bound $s_{1:T}^i+\sqrt{T\cdot(k^i+4)}$.

672: Unfortunately we were not able to generalize Theorem~\ref{thFIFPL}

673: to expert-dependent $\eta$, necessary for the final bound on FPL.

674: In Section~\ref{secHierarchy} we solve this problem by a hierarchy

675: of experts.

676:

677: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

678: \section{Feasible FPL bounded by Infeasible FPL}\label{secFFPL}

679: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

680:

681: This section establishes the relation between the FPL and IFPL

682: losses. Recall that $\ell_t=E\big[M(s_{<t}+{k-q\over\eta_t})\scp

683: s_t\big]$ is the expected loss of FPL at time $t$ and

684: $r_t=E\big[M(s_{1:t}+{k-q\over\eta_t})\scp s_t\big]$ is the

685: expected loss of IFPL at time $t$.

686:

687: \begin{theorem}[FPL bounded by IFPL]\label{thFIFPL}

688: For $\D=\E$ and $0\leq s_t^i\leq 1$ $\forall i$ and arbitrary

689: $s_{<t}$ and $P[q]=\e^{-\sum_i q^i}$ for $q\geq 0$, the expected

690: loss of the feasible FPL is at most a factor $\e^{\eta_t}>1$

691: larger than for the infeasible FPL:

692: \beqn

693:   \ell_t\leq \e^{\eta_t}r_t, \qmbox{which implies}

694:   \ell_{1:T}-r_{1:T}\leq \sum_{t=1}^T\eta_t \ell_t.

695: \eeqn

696: Furthermore, if $\eta_t\leq 1$, then also $\ell_t\leq

697: (1+\eta_t+\eta_t^2)r_t\leq (1+2\eta_t)r_t$.

698: \end{theorem}

699:

700: \paradot{Proof}

701: Let $s=s_{<t}+\sooe k$ be the past cumulative penalized state

702: vector, $q$ be a vector of independent exponential distributions,

703: i.e.\ $P[q^i]=\e^{-q^i}$, and $\eta=\eta_t$.

704: Then

705: \beqn

706:   {P[q^j\geq \eta(s^j-m+1)]\over P[q^j\geq\eta(s^j-m)]}

707:   = \left\{%

708: \begin{array}{ccc}

709:   \e^{-\eta}        & \mbox{if} & s^j\geq m \\

710:   \e^{-\eta(s^j-m+1)} & \mbox{if} & m-1\leq s^j\leq m \\

711:   1                  & \mbox{if} & s^j\leq m-1 \\

712: \end{array}%

713: \right\} \geq \e^{-\eta}

714: \eeqn

715: We now define the random variables $I:=\arg\min_i\{s^i-\sooe q^i\}$ and

716: $J:=\arg\min_i\{s^i+s_t^i-\sooe q^i\}$, where $0\leq s_t^i\leq 1$

717: $\forall i$. Furthermore, for fixed vector $x\in\SetR^n$ and fixed

718: $j$ we define $m:=\min_{i\neq j}\{s^i-\sooe x^i\}\leq \min_{i\neq

719: j}\{s^i+s_t^i-\sooe x^i\}=:m'$.

720: With this notation and using the independence of $q^j$ from $q^i$

721: for all $i\neq j$, we get

722: \beqn

723:   P[I=j|q^i=x^i\,\forall i\neq j]

724:   \;=\; P[s^j-\sooe q^j\leq m|q^i=x^i\,\forall i\neq j]

725:   \;=\; P[q^j\geq\eta(s^j-m)]

726: \eeqn

727: \beqn

728:   \;\leq\; \e^\eta P[q^j\geq\eta(s^j-m+1)]

729:   \;\leq\; \e^\eta P[q^j\geq\eta(s^j+s_t^j-m')]

730: \eeqn

731: \beqn

732:   \;=\; \e^\eta P[s^j+s_t^j-\sooe q^j\leq m'|q^i=x^i\,\forall i\neq j]

733:   \;=\; \e^\eta P[J=j|q^i=x^i\,\forall i\neq j]

734: \eeqn

735: Since this bound holds under any condition $x$, it also holds

736: unconditionally, i.e.\ $P[I=j]\leq \e^\eta P[J=j]$. For

737: $\D=\E$ we have $s_t^I=M(s_{<t}+{k-q\over\eta})\scp s_t$ and

738: $s_t^J=M(s_{1:t}+{k-q\over\eta})\scp s_t$, which implies

739: \beqn

740:   \ell_t

741:   \;=\;E[s_t^I]

742:   \;=\; \sum_{j=1}^n s_t^j\!\cdot\!P[I=j]

743:   \;\leq\; \e^\eta \sum_{j=1}^n s_t^j\!\cdot\!P[J=j]

744:   \;=\; \e^\eta E[s_t^J]

745:   \;=\; \e^\eta r_t.

746: \eeqn

747: Finally, $\ell_t-r_t\leq\eta_t\ell_t$ follows from $r_t\geq

748: \e^{-\eta_t}\ell_t\geq (1-\eta_t)\ell_t$, and $\ell_t\leq

749: \e^{\eta_t}r_t\leq (1+\eta_t+\eta_t^2)r_t\leq (1+2\eta_t)r_t$ for

750: $\eta_t\leq 1$ is elementary.

751: \qed

752:

753: \paradot{Remark}

754: As done by \citet{Kalai:03}, one can prove a similar statement

755: for general decision space $\D$ as long as

756: $\sum_i|s_t^i|\leq A$ is guaranteed for some $A>0$: In this

757: case, we have $\ell_t\leq \e^{\eta_t A}r_t$. If $n$ is finite,

758: then the bound holds for $A=n$. For $n=\infty$, the assertion

759: holds under the somewhat unnatural assumption that $\S$ is

760: $l^1$-bounded.

761:

762: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

763: \section{\boldmath Combination of Bounds and Choices for $\eta_t$}\label{secBounds}

764: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

765:

766: Throughout this section, we assume

767: \beq

768: \label{eq:Assumptions}

769:   \D=\E,\quad s_t\in[0,1]^n\ \forall t,\quad

770:   P[q]=\e^{-\sum_i q^i} \;\mbox{for}\; q\geq 0,\ \qmbox{and}

771:   \sum_i \e^{-k^i}\leq 1.

772: \eeq

773: We distinguish \emph{static} and \emph{dynamic} bounds. Static

774: bounds refer to a constant $\eta_t\equiv\eta$. Since this value

775: has to be chosen in advance, a static choice of $\eta_t$ requires

776: certain prior information and therefore is not practical in many

777: cases. However, the static bounds are very easy to derive, and

778: they provide a good means to compare different PEA algorithms. If

779: on the other hand the algorithm shall be applied without

780: appropriate prior knowledge, a dynamic choice of $\eta_t$ depending

781: only on $t$ and/or past observations, is necessary.

782:

783: \begin{theorem}[FPL bound for static $\eta_t=\eta\propto 1/\sqrt{L}$]\label{thFPLStatic}

784: Assume (\ref{eq:Assumptions}) holds, then the expected loss

785: $\ell_t$ of feasible FPL, which employs the prediction of the

786: expert $i$ minimizing $s_{<t}^i+{k^i-q^i\over\eta_t}$, is bounded

787: by the loss of the best expert in hindsight in the following way:

788: \bqan

789:   i) & & \nq

790:   \mbox{For}\quad \eta_t=\eta=1/\sqrt{L}

791:   \qmbox{with} L\geq\ell_{1:T}

792:   \qmbox{we have}

793: \\

794:      & & \nq

795:   \ell_{1:T}

796:   \;\leq\; s_{1:T}^i + \sqrt{L}(k^i+1) \quad\forall i

797: \\

798:   ii) & & \nq

799:   \mbox{For}\quad \eta_t=\sqrt{K/L}

800:   \qmbox{with} L\geq\ell_{1:T}

801:   \qmbox{and} k^i\leq K \;\forall i

802:   \qmbox{we have}

803: \\

804:     & & \nq

805:   \ell_{1:T}

806:   \;\leq\; s_{1:T}^i + 2\sqrt{LK} \quad\forall i

807: \\

808:   iii) & & \nq

809:   \mbox{For}\quad \eta_t=\sqrt{k^i/L}

810:   \qmbox{with} L\geq \max\{s_{1:T}^i,k^i\}

811:   \qmbox{we have}

812: \\

813:     & & \nq

814:   \ell_{1:T}

815:   \;\leq\; s_{1:T}^i + 2\sqrt{Lk^i}+3k^i

816: \eqan

817: \end{theorem}

818:

819: Note that according to assertion $(iii)$, knowledge of only the

820: \emph{ratio} of the complexity and the loss of the best

821: expert is sufficient in order to obtain good static bounds, even

822: for non-uniform complexities.

823:

824: \paradot{Proof} $(i,ii)$ For $\eta_t=\sqrt{K/L}$ and $L\geq\ell_{1:T}$,

825: from Theorem~\ref{thFIFPL} and Corollary

826: \ref{corIFPL}, we get

827: \beqn

828:   \ell_{1:T}-r_{1:T}

829:   \leq \sum_{t=1}^T\eta_t\ell_t

830:   = \ell_{1:T}\sqrt{K/L}\leq\sqrt{LK}

831:   \qmbox{and}

832:   r_{1:T}-s_{1:T}^i

833:   \leq k^i/\eta_T=k^i\sqrt{L/K}

834: \eeqn

835: Combining both, we get

836: $\ell_{1:T}-s_{1:T}^i\leq\sqrt{L}(\sqrt{K}+k^i/\sqrt{K})$.

837: $(i)$ follows from $K=1$ and $(ii)$ from $k^i\leq K$.

838:

839: \noindent

840: $(iii)$ For $\eta=\sqrt{k^i/L}\leq 1$ we get

841: \bqan

842:   \ell_{1:T}

843:   & \leq & \e^\eta r_{1:T}

844:   \leq (1+\eta+\eta^2)r_{1:T}

845:   \leq (1+\sqrt{k^i\over L}+{k^i\over L})(s_{1:T}^i+\sqrt{L\over

846:   k^i}k^i)\\

847:   & \leq & s_{1:T}^i+\sqrt{Lk^i} +(\sqrt{k^i\over L}+{k^i\over L})(L+\sqrt{Lk^i})

848:   = s_{1:T}^i + 2\sqrt{Lk^i} +(2+\sqrt{k^i\over L})k^i

849: \eqan

850: \qed

851:

852: The static bounds require knowledge of an upper bound $L$ on the

853: loss (or the ratio of the complexity of the best expert and its

854: loss). Since the instantaneous loss is bounded by $1$, one may set

855: $L=T$ if $T$ is known in advance. For finite $n$ and $k^i=K=\ln

856: n$, bound $(ii)$ gives the classic regret $\propto\sqrt{T\ln

857: n}$. If neither $T$ nor $L$ is known, a dynamic choice of $\eta_t$

858: is necessary. We first present bounds with regret $\propto\sqrt{T}$,

859: thereafter with regret $\propto\sqrt{s_{1:T}^i}$.

860:

861: \begin{theorem}[FPL bound for dynamic $\eta_t\propto 1/\sqrt{t}$]\label{thFPLTDynamic}

862: Assume (\ref{eq:Assumptions}) holds.

863: \bqan

864:   i) & & \nq

865:   \mbox{For}\quad \eta_t=1/\sqrt{t}

866:   \qmbox{we have}

867:   \ell_{1:T} \;\leq\; s_{1:T}^i + \sqrt{T}(k^i+2) \quad\forall i

868: \\

869:   ii) & & \nq

870:   \mbox{For}\quad \eta_t=\sqrt{K/2t}

871:   \;\;\mbox{and}\;\; k^i\leq K \;\forall i

872:   \;\;\mbox{we have}\;\;

873:   \ell_{1:T} \;\leq\; s_{1:T}^i + 2\sqrt{2TK}

874:   \quad\forall i

875: \eqan

876: \end{theorem}

877:

878: \paradot{Proof} For $\eta_t=\sqrt{K/2t}$, using

879: $\sum_{t=1}^T{1\over\sqrt{t}}\leq\int_0^T{dt\over\sqrt{t}}=

880: 2\sqrt{T}$ and $\ell_t\leq 1$ we get

881: \beqn

882:   \ell_{1:T}-r_{1:T}

883:   \leq \sum_{t=1}^T \eta_t

884:   \leq \sqrt{2TK}

885:   \qmbox{and}

886:   r_{1:T}-s_{1:T}^i

887:   \leq {k^i/\eta_T}=k^i\sqrt{2T\over K}

888: \eeqn

889: Combining both, we get

890: $\ell_{1:T}-s_{1:T}^i \leq \sqrt{2T}(\sqrt{K}+k^i/\sqrt{K})$.

891: $(i)$ follows from $K=2$ and $(ii)$ from $k^i\leq K$.

892: \qed

893:

894: In Theorem~\ref{thFPLStatic} we assumed knowledge of an

895: upper bound $L$ on $\ell_{1:T}$. In an adaptive form,

896: $L_t:=\ell_{<t}+1$, known at the beginning of time $t$, could be used

897: as an upper bound on $\ell_{1:t}$ with corresponding adaptive

898: $\eta_t\propto 1/\sqrt{L_t}$. Such choice of $\eta_t$ is also

899: called \emph{self-confident} \citep{Auer:02pea}.

900:

901: \begin{theorem}[FPL bound for self-confident $\eta_t\propto 1/\sqrt{\ell_{<t}}$]\label{thFPLLDynamic}

902: Assume (\ref{eq:Assumptions}) holds.

903: \bqan

904:   i) & & \nq

905:   \mbox{For}\quad \eta_t=1/\sqrt{2(\ell_{<t}+1)}

906:   \qmbox{we have}

907: \\

908:    & & \nq

909:   \ell_{1:T}

910:   \;\leq\; s_{1:T}^i + (k^i\!+\!1)\sqrt{2(s_{1:T}^i\!+\!1)} + 2(k^i\!+\!1)^2

911:   \quad\forall i

912: \\

913:   ii) & & \nq

914:   \mbox{For}\quad \eta_t=\sqrt{K/2(\ell_{<t}+1)}

915:   \qmbox{and} k^i\leq K \;\forall i

916:   \qmbox{we have}

917: \\

918:     & & \nq

919:   \ell_{1:T}

920:   \;\leq\; s_{1:T}^i + 2\sqrt{2(s_{1:T}^i\!+\!1)K} + 8K

921:   \quad\forall i

922: \eqan

923: \end{theorem}

924:

925: \paradot{Proof} Using

926: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}\leq\sqrt{K/2\ell_{1:t}}$ and

927: ${b-a\over\sqrt

928: b}=(\sqrt{b}-\sqrt{a})(\sqrt{b}+\sqrt{a}){1\over\sqrt{b}}\leq

929: 2(\sqrt{b}-\sqrt{a})$ for $a\leq b$ and $t_0:=\min\{t:\ell_{1:t}>0\}$ we get

930: \beqn\label{eqLD}

931:   \ell_{1:T}\!-\!r_{1:T}

932:   \leq \sum_{t=t_0}^T \eta_t\ell_t

933:   \leq \sqrt{K\over 2}\sum_{t=t_0}^T {\ell_{1:t}\!-\!\ell_{<t}\over\sqrt{\ell_{1:t}}}

934:   \leq \sqrt{2K}\sum_{t=t_0}^T [\sqrt{\ell_{1:t\!\!}}\,-\!\sqrt{\ell_{<t\!\!}}\;]

935:   = \sqrt{2K}\sqrt{\ell_{1:T}}

936: \eeqn

937: Adding

938: $r_{1:T}-s_{1:T}^i \leq {k^i\over\eta_T} \leq

939: k^i\sqrt{2(\ell_{1:T}+1)/K}$ we get

940: \beqn

941:   \ell_{1:T}-s_{1:T}^i

942:   \leq \sqrt{2\bar\kappa^i(\ell_{1:T}\!+\!1)},

943:   \qmbox{where}

944:   \sqrt{\bar\kappa^i}:=\sqrt{K}+k^i/\sqrt{K}.

945: \eeqn

946: Taking the square and solving the resulting quadratic inequality

947: w.r.t.\ $\ell_{1:T}$ we get

948: \beqn

949:   \ell_{1:T}

950:   \leq s_{1:T}^i + \bar\kappa^i + \sqrt{2(s_{1:T}^i\!+\!1)\bar\kappa^i+(\bar\kappa^i)^2}

951:   \leq s_{1:T}^i + \sqrt{2(s_{1:T}^i\!+\!1)\bar\kappa^i} + 2\bar\kappa^i

952: \eeqn

953: For $K=1$ we get $\sqrt{\bar\kappa^i}=k^i+1$ which yields $(i)$.

954: For $k^i\leq K$ we get $\bar\kappa^i\leq 4K$ which yields $(ii)$.

955: \qed

956:

957: The proofs of results similar to $(ii)$ for WM for 0/1 loss

958: all fill several pages \citep{Auer:02pea,Yaroshinsky:04}. The

959: next result establishes a similar bound, but instead of using

960: the \emph{expected} value $\ell\ltt$, the \emph{best loss so

961: far}

962: $s\ltt\smin$ is used. This may have computational advantages,

963: since $s\ltt\smin$ is immediately available, while $\ell\ltt$

964: needs to be evaluated (see discussion in Section~\ref{secMisc}).

965:

966: \begin{theorem}[FPL bound for adaptive $\eta_t\propto 1/\sqrt{s\ltt\smin}$]\label{thFPL2}

967: Assume (\ref{eq:Assumptions}) holds.

968: \bqan

969:   i) & & \nq

970:   \mbox{For}\quad \eta_t = 1/\min_i\{k^i+\sqrt{(k^i)^2+2s^i\ltt+2}\}

971:   \qmbox{we have}

972: \\

973:    & & \nq

974:   \ell\leqT \;\leq\; s\leqT^i+(k^i\!+2)\sqrt{2s\leqT^i}+2(k^i\!+2)^2

975:   \quad \forall i

976: \\

977:   ii) & & \nq

978:   \mbox{For}\quad \eta_t =

979:   \sqrt{\odt}\!\cdot\!\min\{1,\sqrt{K/s\ltt\smin}\}

980:   \qmbox{and} k^i\leq K \;\forall i

981:   \qmbox{we have}

982: \\

983:     & & \nq

984:   \ell\leqT \;\leq\;

985:   s\leqT^i+2\sqrt{2K s\leqT^i}+5K\ln(s\leqT^i)+3K+6

986:   \quad \forall i

987: \eqan

988: \end{theorem}

989: %

990: We briefly motivate the strange looking choice for $\eta_t$ in

991: $(i)$. The first naive candidate, $\eta_t\propto 1/\sqrt{s\ltt^{min}}$,

992: turns out too large. The next natural trial is requesting

993: $\eta_t=1/\sqrt{2\min\{s\ltt^i+\frac{k^i}{\eta_t}\}}$. Solving

994: this equation results in $\eta_t=1/(k^i+\sqrt{(k^i)^2+2s\ltt^i})$,

995: where $i$ be the index for which $s\ltt^i+\frac{k^i}{\eta_t}$ is

996: minimal.

997:

998: \paradot{Proof}

999: Define the minimum of a vector as its minimum component, e.g.\

1000: $\min(k)=k\smin$.

1001: For notational convenience, let

1002: $\eta_0=\infty$ and $\tilde s\leqt=s\leqt+\frac{k-q}{\eta_t}$.

1003: Like in the proof of Theorem~\ref{thIFPL}, we consider one

1004: exponentially distributed perturbation $q$. Since $M(\tilde

1005: s\leqt)\scp \tilde s_t \leq M(\tilde s\leqt)\scp \tilde s\leqt-

1006: M(\tilde s\ltt)\scp \tilde s\ltt$ by (\ref{eq:noregret1}), we have

1007: \beqn

1008: M(\tilde s\leqt)\scp s_t\leq M(\tilde s\leqt)\scp

1009: \tilde s\leqt- M(\tilde s\ltt)\scp \tilde s\ltt -

1010: M(\tilde s\leqt)\scp

1011: \left(\frac{k-q}{\eta_{t}}-\frac{k-q}{\eta_{t-1}}\right)

1012: \eeqn

1013: Since $\eta_t\leq\sqrt{^1\!/_2}$, Theorem~\ref{thFIFPL} asserts

1014: $\ell_t\leq E[(1+\eta_t+\eta_t^2)M(\tilde s\leqt)\scp s_t]$, thus

1015: $\ell\leqT\leq A+B$, where

1016: \bqan

1017: A & = & \sum_{t=1}^T E\left[(1+\eta_t+\eta_t^2)(M(\tilde

1018: s\leqt)\scp

1019: \tilde s\leqt- M(\tilde s\ltt)\scp \tilde s\ltt)\right]\\

1020: & = &

1021: E[(1+\eta_T+\eta_T^2)M(\tilde s\leqT) \scp\tilde s\leqT]

1022: - E[(1+\eta_1+\eta_1^2)\min(\frac{k-q}{\eta_1})]\\

1023: && + \sum_{t=1}^{T-1}E\left[

1024: (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)M(\tilde

1025: s\leqt)\scp\tilde s\leqt\right]

1026: \mbox{\quad and}\\

1027: B & = & \sum_{t=1}^T E\left[(1+\eta_t+\eta_t^2) M(\tilde

1028: s\leqt)\scp

1029: \left(\frac{q-k}{\eta_{t}}-\frac{q-k}{\eta_{t-1}}\right)\right]\\

1030: & \leq & \sum_{t=1}^T (1+\eta_t+\eta_t^2)

1031: \left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)

1032: =\frac{1+\eta_T+\eta_T^2}{\eta_T}+

1033: \sum_{t=1}^{T-1}\frac{\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2}{\eta_t}

1034: \eqan

1035: Here, the estimate for $B$ follows from

1036: $\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\geq 0$ and

1037: $E [M(\eta_t s\leqt+k-q)\scp(q-k)]\leq E[\max_i\{q^i-k^i\}]\leq 1$, which

1038: in turn holds by minimality of $M$, $\sum_i \e^{-k^i}\leq 1$ and

1039: Lemma~\ref{lemExpMax}. In order to estimate $A$, we set

1040: $\bar s\leqt=s\leqt+\frac{k}{\eta_t}$ and observe

1041: $M(\tilde s\leqt)\scp\tilde s\leqt\leq

1042: M(\bar s\leqt)\scp(\bar s\leqt-\frac{q}{\eta_t})$ by minimality

1043: of $M$. The expectations of $q$ can then be evaluated to

1044: $E[M(\bar s\leqt)\scp q]=1$, and as before we have $E[-\min(k-q)]\leq 1$.

1045: Hence

1046: \bqa

1047: \nonumber

1048: \ell\leqT & \leq & A+B\ \leq\

1049: (1+\eta_T+\eta_T^2)\left(M(\bar s\leqT)\scp\bar

1050: s\leqT-\frac{1}{\eta_T}\right)

1051: + \frac{1+\eta_1+\eta_1^2}{\eta_1}\\

1052: \label{eq:basicest}

1053: && + \sum_{t=1}^{T-1}

1054: (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)\left(M(\bar

1055: s\leqt)\scp\bar s\leqt-\frac{1}{\eta_t}\right)+B\\

1056: \nonumber

1057: & \leq &

1058: (1+\eta_T+\eta_T^2)\min(\bar s\leqT)+

1059: \sum_{t=1}^{T-1} (\eta_t-\eta_{t+1}+\eta_t^2-\eta_{t+1}^2)

1060: \min(\bar s\leqt)+\frac{1}{\eta_1}+2.

1061: \eqa

1062: We now proceed by considering the two parts of the theorem

1063: separately.

1064:

1065: $(i)$

1066: Here,

1067: $\eta_t=1/\min(k+\sqrt{k^2+2s\ltt+2})$. Fix $t\leq T$ and

1068: choose $m$ such that

1069: $k^m+\sqrt{(k^m)^2+2s\ltt^m+2}$ is minimal. Then

1070: \beqn \min(s\leqt+\frac{k}{\eta_t})

1071: \leq s\ltt^m+1+\frac{k^m}{\eta_t}

1072: =

1073: \mbox{$\frac{1}{2}$}\big(k^m+\sqrt{(k^m)^2+2s\ltt^m+2}\big)^2=\frac{1}{2\eta_t^2}

1074: \leq\frac{1}{2\eta_t\eta_{t+1}}.

1075: \eeqn

1076: We may overestimate the quadratic terms $\eta_t^2$ in

1077: (\ref{eq:basicest}) by $\eta_t$ -- the easiest justification

1078: is that we could have started with the cruder estimate

1079: $\ell_t\leq(1+2\eta_t)r_t$ from Theorem~\ref{thFIFPL}. Then

1080: \bqan

1081: \ell\leqT & \leq &

1082: (1+2\eta_T)\min(s\leqT+\frac{k}{\eta_T})+ 2\sum_{t=1}^{T-1}

1083: (\eta_t-\eta_{t+1})\min(s\leqt+\frac{k}{\eta_t})+\frac{1}{\eta_1}+2\\

1084: & \leq &

1085: (1+2\eta_T)\frac{1}{2\eta_T^2}+ 2\sum_{t=1}^{T-1}

1086: (\eta_t-\eta_{t+1})\frac{1}{2\eta_t^2}+\frac{1}{\eta_1}+2\\

1087: & \leq & \frac{1}{2\eta_T^2}+\frac{1}{\eta_T}+

1088: \sum_{t=1}^{T-1}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_t}\right)+\frac{1}{\eta_1}+2\\

1089: & \leq &

1090: \mbox{$\frac{1}{2}$}\min(k+\sqrt{k^2+2s\ltT+2})^2+2\min(k+\sqrt{k^2+2s\ltT+2}) +2\\

1091: & \leq &

1092: s\leqT^i+(k^i+2)\sqrt{2s\leqT^i}+2(k^i)^2+6k^i+6

1093: \quad\mbox{for all}\ i.

1094: \eqan

1095: This proves the first part of the theorem.

1096:

1097: $(ii)$ Here we have $K\geq k^i$ for all $i$. Abbreviate

1098: $a_t=\max\{K,s\leqt\smin\}$ for $1\leq t\leq T$, then

1099: $\eta_t=\sqrt{\frac{K}{2a_{t-1}}}$,

1100: $a_t\geq K$, and $a_t-a_{t-1}\leq 1$ for all $t$. Observe

1101: $M(\bar s\leqt)=M(s\leqt)$,

1102: $\eta_t-\eta_{t+1}=\frac{\sqrt K(a_t-a_{t-1})}

1103: {\sqrt 2\sqrt{a_t}\sqrt{a_{t-1}} (\sqrt{a_t}+\sqrt{a_{t-1}})}$,

1104: $\eta_t^2-\eta_{t+1}^2=\frac{K(a_t-a_{t-1})}{2a_ta_{t-1}}$, and

1105: $\frac{a_t-a_{t-1}} {2 a_{t-1}}\leq

1106: \ln(1+\frac{a_t-a_{t-1}}{a_{t-1}})=\ln (a_t)-\ln (a_{t-1})$ which is true for

1107: $\frac{a_t-a_{t-1}}{a_{t-1}}\leq\frac{1}{K}\leq\frac{1}{\ln 2}$. This

1108: implies

1109: \bqan

1110: \frac{(\eta_t-\eta_{t+1})K}{\eta_t} & \leq &

1111: \frac{K(a_t-a_{t-1})}

1112: {2 a_{t-1}}\leq K\ln\left(1+\frac{a_t-a_{t-1}}{a_{t-1}}\right)

1113: = K\big(\ln(a_t)-\ln(a_{t-1})\big), \\

1114: (\eta_t-\eta_{t+1})s\leqt\smin & \leq &

1115: \frac{\sqrt K(a_t-a_{t-1})(\sqrt{a_{t-1}}+

1116: \sqrt{a_t}-\sqrt{a_{t-1}})}

1117: {\sqrt 2\sqrt{a_{t-1}} (\sqrt{a_t}+\sqrt{a_{t-1}})}\\

1118: & = & \sqrt{\frac{K}{2}}(\sqrt{a_t}-\sqrt{a_{t-1}})

1119: +\frac{\sqrt K(a_t-a_{t-1})^2}{\sqrt{2a_{t-1}}(\sqrt{a_t}+\sqrt{a_{t-1}})^2}\\

1120: & \leqs

1121: {\hspace*{-5cm}\rlap{\fbox{$\stackrel{\mbox{\tiny use

1122: }\scriptstyle a_t-a_{t-1}\leq 1} {\mbox{\tiny and }

1123: \scriptstyle a_{t-1}\geq K}$}}}

1124: & \sqrt{\frac{K}{2}}(\sqrt{a_t}-\sqrt{a_{t-1}})+

1125: \frac{1}{2\sqrt 2}\big(\ln(a_t)-\ln(a_{t-1})\big),\\

1126: \frac{(\eta_t^2-\eta_{t+1}^2)K}{\eta_t} & = &

1127: \frac{K\sqrt{K}(a_t-a_{t-1})}{\sqrt{2}a_t\sqrt{a_{t-1}}}

1128: \leqs {\fbox{$\scriptstyle a_{t-1}\geq K$}}

1129: \sqrt{2}K\big(\ln(a_t)-\ln(a_{t-1})\big), \mbox{ and}\\

1130: (\eta_t^2-\eta_{t+1}^2)s\leqt\smin & \leq &

1131: \frac{K(a_t-a_{t-1})}{2a_{t-1}} \leq

1132: K\big(\ln(a_t)-\ln(a_{t-1})\big),

1133: \eqan

1134: The logarithmic estimate in the second and

1135: third bound is unnecessarily rough and for convenience only.

1136: Therefore, the coefficient of the log-term in the final bound of

1137: the theorem can be reduced to

1138: $2K$ without much effort. Plugging the above estimates back into

1139: (\ref{eq:basicest}) yields

1140: \bqan

1141: \ell\leqT & \leq & s\leqT\smin+\sqrt{\frac{K}{2} s\leqT\smin}+\sqrt{2K s\leqT\smin}+3K+2

1142: +\sqrt{\frac{K}{2} s\leqT\smin}+

1143: \big(\mbox{$\frac{7}{2}$}K+\mbox{$\frac{1}{2\sqrt 2}$}\big)\ln(s\leqT\smin)\\

1144: &&+\frac{1}{\eta_1}+2

1145: \leq s\leqT\smin+2\sqrt{2K

1146: s\leqT\smin}+5K\ln(s\leqT\smin)+3K+6.

1147: \eqan

1148: This completes the proof.

1149: \qed

1150:

1151: Theorem~\ref{thFPLLDynamic} and Theorem~\ref{thFPL2} $(i)$

1152: immediately imply the following bounds on the

1153: $\sqrt{\mbox{Loss}}$-regrets:

1154: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i+1}+\sqrt{8K}$,

1155: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i+1}+\sqrt{2}(k^i+1)$, and

1156: $\sqrt{\ell_{1:T}}\leq\sqrt{s_{1:T}^i}+\sqrt{2}(k^i+2)$,

1157: respectively.

1158:

1159: \paradot{Remark}

1160: The same analysis as for Theorems

1161: [\ref{thFPLStatic}--\ref{thFPL2}]$(ii)$ applies to general $\D$,

1162: using $\ell_t\leq \e^{\eta_t n}r_t$ instead of $\ell_t\leq

1163: \e^{\eta_t}r_t$, and leading to an additional factor $\sqrt{n}$ in

1164: the regret. Compare the remark at the end of Section~\ref{secFFPL}.

1165:

1166: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1167: \section{Hierarchy of Experts}\label{secHierarchy}

1168: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1169:

1170: We derived bounds which do not need prior knowledge of $L$ with

1171: regret $\propto\sqrt{TK}$ and $\propto\sqrt{s_{1:T}^i K}$ for a

1172: finite number of experts with equal penalty $K=k^i=\ln n$. For

1173: an infinite number of experts, unbounded expert-dependent complexity

1174: penalties $k^i$ are necessary (due to constraint $\sum_i

1175: \e^{-k^i}\leq 1$). Bounds for this case (without prior knowledge of

1176: $T$) with regret $\propto k^i\sqrt{T}$ and $\propto

1177: k^i\sqrt{s_{1:T}^i}$ have been derived. In this case, the

1178: complexity $k^i$ is no longer under the square root. Although

1179: this already implies Hannan consistency, i.e.\ the average per

1180: round regret tends to zero as $t\to\infty$, improved regret

1181: bounds $\propto\sqrt{Tk^i}$ and $\propto\sqrt{s_{1:T}^i k^i}$

1182: are desirable and likely to hold. We were not able to derive

1183: such improved bounds for FPL, but for a (slight) modification.

1184: We consider a two-level hierarchy of experts. First consider

1185: an FPL for the subclass of experts of complexity

1186: $K$, for each $K\in\SetN$. Regard these FPL$^K$ as (meta) experts

1187: and use them to form a (meta) FPL. The class of meta experts now

1188: contains for each complexity only one (meta) expert, which allows

1189: us to derive good bounds. In the following, quantities referring

1190: to complexity class $K$ are superscripted by $K$, and meta

1191: quantities are superscripted by $\;\widetilde{}$ .

1192:

1193: Consider the class of experts $\E^K:=\{i:K-1<k^i\leq K\}$ of

1194: complexity $K$, for each $K\in\SetN$. FPL$^K$ makes randomized

1195: prediction

1196: $I_t^K:=\arg\min_{i\in\E^K}\{s_{<t}^i+\smash{k^i-q^i\over\eta_t^K}\}$

1197: with $\eta_t^K:=\sqrt{K/2t}$ and suffers loss $u_t^K:=s_t^{I_t^K}$

1198: at time $t$. Since $k^i\leq K$ $\forall i\in\E^k$ we can apply

1199: Theorem~\ref{thFPLTDynamic}$(ii)$ to FPL$^K$:

1200: \beq\label{eqFH}

1201:   E[u_{1:T}^K] \;=\; \ell_{1:T}^K \;\leq\; s_{1:T}^i+ 2\sqrt{2TK}

1202:   \quad \forall i\in\E^K

1203:   \quad \forall K\in\SetN.

1204: \eeq

1205: We now define a meta state $\tilde s_t^K=u_t^K$ and regard FPL$^K$

1206: for $K\in\SetN$ as meta experts, so meta expert $K$ suffers loss

1207: $\tilde s_t^K$. (Assigning expected loss $\tilde

1208: s_t^K=E[u_t^K]=\ell_t^K$ to FPL$^K$ would also work.) Hence the

1209: setting is again an expert setting and we define the meta

1210: $\widetilde{\mbox{FPL}}$ to predict $\tilde

1211: I_t:=\arg\min_{K\in\SetN}\{\tilde s_{<t}^K+{\tilde k^K-\tilde

1212: q^K\over \tilde\eta_t}\}$ with $\tilde\eta_t=1/\sqrt{t}$ and

1213: $\tilde k^K=\odt+2\ln K$ (implying $\sum_{K=1}^\infty \e^{-\tilde

1214: k^K}\leq 1$). Note that $\tilde s_{1:t}^K=\tilde s_1^K+...+\tilde

1215: s_t^K= s_1^{I_1^K}+...+s_t^{I_t^K}$ sums over the same meta state

1216: components $K$, but over different components ${I_t^K}$ in normal

1217: state representation.

1218:

1219: By Theorem~\ref{thFPLTDynamic}$(i)$ the $\tilde q$-expected loss

1220: of $\widetilde{\mbox{FPL}}$ is bounded by $\tilde s_{1:T}^K +

1221: \sqrt{T}(\tilde k^K+2)$. As this bound holds for all $q$ it also holds

1222: in $q$-expectation. So if we define $\tilde\ell_{1:T}$ to be the

1223: $q$ {\em and} $\tilde q$ expected loss of

1224: $\widetilde{\mbox{FPL}}$, and chain this bound with (\ref{eqFH})

1225: for $i\in\E^K$ we get:

1226: \bqan

1227:   \tilde\ell_{1:T}

1228:   &\leq& E[\tilde s_{1:T}^K + \sqrt{T}(\tilde k^K\!+2)]

1229:    \;=\; \ell_{1:T}^K + \sqrt{T}(\tilde k^K\!+2) \\

1230:   &\leq& s_{1:T}^i+ \sqrt{T}[2\sqrt{2(k^i\!+1)}+\odt+2\ln (k^i\!+1)+2],

1231: \eqan

1232: where we have used $K\leq k^i+1$. This bound is valid for all $i$

1233: and has the desired regret $\propto\sqrt{T k^i}$. Similarly we can

1234: derive regret bounds $\propto\sqrt{s_{1:T}^i k^i}$ by exploiting

1235: that the bounds in Theorems~\ref{thFPLLDynamic} and \ref{thFPL2}

1236: are concave in $s_{1:T}^i$ and using Jensen's inequality.

1237:

1238: \begin{theorem}[Hierarchical FPL bound for dynamic $\eta_t$]\label{thHFPL}

1239: The hierarchical $\widetilde{\mbox{FPL}}$ employs at time $t$

1240: the prediction of expert $i_t:=I_t^{\tilde I_t}$, where

1241: \vspace{-0.5ex}\beqn

1242:   I_t^K:=\mathop{\arg\min}_{i:\lceil k^i\rceil=K}

1243:     \Big\{s_{<t}^i+{\textstyle{k^i-q^i\over\eta_t^K}}\Big\}

1244:   \qmbox{and}

1245:   \tilde I_t:=\mathop{\arg\min}_{K\in\SetN}

1246:     \Big\{s_1^{I_1^K}+...+s_{t-1}^{I_{t-1}^K}+

1247:     {\textstyle{{1\over 2}+2\ln\!K -\tilde q^K\over \tilde\eta_t}}\Big\}

1248:   \vspace{-1.5ex}

1249: \eeqn

1250: Under assumptions (\ref{eq:Assumptions}) and independent $P[\tilde q^K]=\e^{-\tilde

1251: q^K}$ $\forall K\in\SetN$, the

1252: expected loss $\tilde\ell_{1:T}=E[s_1^{i_1}+...+s_T^{i_T}]$ of

1253: $\widetilde{\mbox{FPL}}$ is bounded as follows:

1254: \bqan

1255:   a) & & \nq

1256:   \mbox{For}\quad \eta_t^K=\sqrt{K/2t}

1257:   \qmbox{and} \tilde\eta_t=1/\sqrt{t}

1258:   \qmbox{we have}

1259: \\

1260:    & & \nq

1261:   \tilde\ell_{1:T}

1262:   \;\leq\; s_{1:T}^i + 2\sqrt{2Tk^i}\!\cdot\!\big(1+O({\textstyle{\ln k^i\over \sqrt{k^i}}})\big)

1263:   \quad\forall i.

1264: \\

1265:   b) & & \nq

1266:   \mbox{For $\tilde\eta_t$ as in $(i)$ and $\eta_t^K$ as in $(ii)$

1267:   of Theorem $\{{\ref{thFPLLDynamic}\atop\ref{thFPL2}}\}$ we have}

1268: \\

1269:     & & \nq

1270:   \tilde\ell_{1:T}

1271:   \;\leq\; s_{1:T}^i + 2\sqrt{2s_{1:T}^i k^i}\!\cdot\!\big(1+O({\textstyle{\ln k^i\over \sqrt{k^i}}})\big)

1272:   + {\textstyle\big\{{O(k^i)\atop O(k^i\ln s_{1:T}^i)}\big\}}

1273:   \quad\forall i.

1274: \eqan

1275: \end{theorem}

1276: %

1277: The hierarchical $\widetilde{\mbox{FPL}}$ differs from a

1278: direct FPL over all experts $\E$. One potential way to prove a

1279: bound on direct FPL may be to show (if it holds) that FPL

1280: performs better than $\widetilde{\mbox{FPL}}$, i.e.\ $\ell_{1:T}\leq

1281: \tilde\ell_{1:T}$. Another way may be to suitably generalize

1282: Theorem~\ref{thFIFPL} to expert dependent $\eta$.

1283:

1284: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1285: \section{Lower Bound on FPL}\label{secLowFPL}

1286: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1287:

1288: A lower bound on FPL similar to the upper bound in Theorem

1289: \ref{thIFPL} can also be proven.

1290:

1291: \begin{theorem}[FPL lower-bounded by BEH]\label{thLowFPL}

1292: Let $n$ be finite. Assume $\D\subseteq\SetR^n$ and $s_t\in\SetR^n$

1293: are chosen such that the required extrema exist (possibly

1294: negative), $q\in\SetR^n$, and $\eta_t>0$ is a

1295: decreasing sequence. Then the loss of FPL for uniform

1296: complexities (l.h.s.) can be lower-bounded in terms of the best

1297: predictor in hindsight (first term on r.h.s.) plus/minus additive

1298: corrections:

1299: \beqn

1300:   \sum_{t=1}^T M(s_{<t}-{q\over\eta_t})\scp s_t

1301:   \geq \min_{d\in\D}\{d\scp s_{1:T}\}

1302:      - {1\over\eta_T}\max_{d\in\D}\{d\scp q\}

1303:      + \sum_{t=1}^T ({1\over\eta_t}\!-\!{1\over\eta_{t-1}}) M(s_{<t})\scp q

1304: \eeqn

1305: \end{theorem}

1306:

1307: \paradot{Proof}

1308: For notational convenience, let $\eta_0=\infty$ and

1309: $\tilde s\leqt=s\leqt-\frac{q}{\eta_t}$. Consider the losses

1310: $\tilde s_t=s_t-q\big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\big)$

1311: for the moment. We first show by induction on $T$ that the

1312: predictor $M(\tilde s_{<t})$ has nonnegative regret, i.e.\

1313: \beq\label{eqposregret}

1314:   \sum_{t=1}^T M(\tilde s_{<t})\scp\tilde s_t \geq M(\tilde s_{1:T})\scp\tilde s_{1:T}.

1315: \eeq

1316: For $T=1$ this follows immediately from minimality of $M$

1317: ($\tilde s_{<1}:=0$). For the induction step from $T-1$ to $T$ we

1318: need to show

1319: \beqn

1320:   M(\tilde s_{<T})\scp \tilde s_T \geq M(\tilde s_{1:T})\scp \tilde s_{1:T} -

1321:   M(\tilde s_{<T})\scp \tilde s_{<T}.

1322: \eeqn

1323: Due to $\tilde s_{1:T}=\tilde s_{<T}+\tilde s_T$, this is

1324: equivalent to $M(\tilde s_{<T})\scp \tilde s_{1:T} \geq M(\tilde

1325: s_{1:T})\scp \tilde s_{1:T}$, which holds by minimality of

1326: $M$. Rearranging terms in (\ref{eqposregret}) we obtain

1327: \beq\label{eqifpl2l}

1328:   \sum_{t=1}^T M(\tilde s_{<t})\scp s_t

1329:   \geq M(\tilde s_{1:T})\scp \tilde s\leqT

1330:    + \sum_{t=1}^T M(\tilde s_{<t})\scp q

1331:    \Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big), \quad\mbox{with}

1332: \eeq

1333: \beqn

1334:   M(\tilde s_{1:T})\scp \tilde s\leqT=

1335:   M(s_{1:T}-\frac{q}{\eta_T})\scp s_{1:T}

1336:   -M(s_{1:T}-\frac{q}{\eta_T})\scp \frac{q}{\eta_T}

1337:   \geq \min_{d\in\D}\{d\scp s_{1:T}\}

1338:   - {1\over\eta_T}\max_{d\in\D}\{d\scp q\}

1339: \eeqn

1340: \beqn

1341: \mbox{and}\quad \sum_{t=1}^T M(\tilde s_{<t})\scp q

1342: \Big(\frac{1}{\eta_t}-\frac{1}{\eta_{t-1}}\Big)

1343: \;\geq\; \sum_{t=1}^T \Big({1\over\eta_t}-{1\over\eta_{t-1}}\Big)M(s_{<t})\scp q

1344: \eeqn

1345: Again, the last bound follows from the minimality of $M$, which

1346: asserts that $[M(s-q)-M(s)]\scp s\geq 0\geq

1347: [M(s-q)-M(s)]\scp(s-q)$ and thus implies that $M(s-q)\scp q\geq

1348: M(s)\scp q$. So Theorem \ref{thLowFPL} follows from (\ref{eqifpl2l}).

1349: \qed

1350:

1351: Assuming $q$ random with $E[q^i]=1$ and taking the expectation in

1352: Theorem~\ref{thLowFPL}, the last term reduces to

1353: $\sum_t({1\over\eta_t}-{1\over\eta_{t-1}})\sum_i

1354: M(s_{<t})^i$.

1355: If $\D\geq 0$, the term is positive and may be dropped. In case of

1356: $\D=\E$ or $\Delta$, the last term is identical to

1357: ${1\over\eta_T}$ (since $\sum_i d^i=1$) and keeping it improves

1358: the bound.

1359: %

1360: Furthermore, we need to evaluate the expectation of the second to

1361: last term in Theorem~\ref{thLowFPL}, namely

1362: $E[\max_{d\in\D}\{d\scp q\}]$. For $\D=\E$ and $q$ being

1363: exponentially distributed, using Lemma~\ref{lemExpMax} with

1364: $k^i=0$ $\forall i$, the expectation is bounded by $1+\ln n$.

1365: We hence get the following lower bound:

1366:

1367: \begin{corollary}[FPL lower-bounded by BEH]\label{corLowFPL}

1368: For $\D=\E$ and any $\S$ and all $k^i$ equal and

1369: $P[q^i]=\e^{-q^i}$ for $q\geq 0$ and decreasing $\eta_t>0$, the

1370: expected loss of FPL is at most

1371: $\ln n/\eta_T$ lower than the loss of the best expert in hindsight:

1372: \beqn

1373:   \ell_{1:T} \;\geq\; s_{1:T}^{min} - {\ln n\over\eta_T}

1374: \eeqn

1375: \end{corollary}

1376:

1377: The upper and lower bounds on $\ell_{1:T}$

1378: (Theorem~\ref{thFIFPL} and Corollaries~\ref{corIFPL} and

1379: \ref{corLowFPL}) together show that

1380: \beq\label{eqltos}

1381:   {\ell_{1:t}\over s_{1:t}^{min}} \to 1

1382:   \quad\qmbox{if}\quad

1383:   \eta_t\to 0

1384:   \qmbox{and}

1385:   \eta_t\!\cdot\!s_{1:t}^{min} \to \infty

1386:   \qmbox{and}

1387:   k^i=K\;\forall i

1388: \eeq

1389: For instance, $\eta_t=\sqrt{K/2 s_{<t}^{min}}$. For

1390: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}$ we proved the bound in Theorem

1391: \ref{thFPLLDynamic}$(ii)$. Knowing that $\sqrt{K/2(\ell_{<t}+1)}$

1392: converges to $\sqrt{K/2 s_{<t}^{min}}$ due to (\ref{eqltos}), we

1393: can derive a bound similar to Theorem~\ref{thFPLLDynamic}$(ii)$

1394: for $\eta_t=\sqrt{K/2 s_{<t}^{min}}$. This choice for $\eta_t$ has

1395: the advantage that we do not have to compute $\ell_{<t}$ (cf.\

1396: Section~\ref{secComp}), as also achieved by Theorem~\ref{thFPL2}$(ii)$.

1397:

1398: We do not know whether Theorem~\ref{thLowFPL} can be

1399: generalized to expert dependent complexities $k^i$.

1400:

1401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1402: \section{Adaptive Adversary}\label{secAdap}

1403: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1404:

1405: In this section we show that bounds that hold against an

1406: oblivious adversary automatically also hold against an

1407: adaptive one.

1408:

1409: %-------------------------------%

1410: \paradot{Initial versus independent randomization}

1411: %-------------------------------%

1412: So far we assumed that the perturbations $q$ are sampled only once at

1413: time $t=0$. As already indicated, under the expectation this is

1414: equivalent to generating a new perturbation $q_t$ at each time

1415: step $t$, i.e.\ Theorems \ref{thFIFPL}--\ref{thHFPL} remain valid

1416: for this case. While the former choice was favorable for the

1417: analysis, the latter has two advantages.

1418: %

1419: First, repeated sampling of the perturbations guarantees better

1420: bounds with high probability (see next section).

1421: %

1422: Second, if the losses are generated by an adaptive adversary (not

1423: to be confused with an adaptive learning rate) which has access to

1424: FPL's past decisions, then he may after some time figure out the

1425: initial random perturbation and use it to force FPL to have a

1426: large loss.

1427: %

1428: We now show that the bounds for FPL remain valid, even in case of

1429: an adaptive adversary, if independent randomization $q\leadsto

1430: q_t$ is used.

1431:

1432: %-------------------------------%

1433: \paradot{Oblivious versus adaptive adversary}

1434: %-------------------------------%

1435: Recall the protocol for FPL: After each expert $i$ made its

1436: prediction $y_t^i$, and FPL combined them to form its own prediction

1437: $y_t^{\FPL}$, we observe $x_t$, and Loss($x_t,y_t^{\cdots}$) is

1438: revealed for FPL's and each expert's prediction. For independent

1439: randomization, we have $y_t^{\FPL}=y_t^{\FPL}(x_{<t},y_{1:t},q_t)$. For an

1440: oblivious (non-adaptive) adversary, $x_t=x_t(x_{<t},y_{<t})$.

1441: Recursively inserting and eliminating the experts

1442: $y_t^i=y_t^i(x_{<t},y_{<t})$ and $y_t^{\FPL}$, we get the dependencies

1443: \beq\label{eqnAdapDep}

1444:   u_t :=\mbox{Loss}(x_t,y_t^{\FPL}) = u_t(x_{1:t},q_t)

1445:   \qmbox{and}

1446:   s_t^i := \mbox{Loss}(x_t,y_t^i) = s_t^i(x_{1:t}),

1447: \eeq

1448: where $x_{1:t}$ is a ``fixed'' sequence.

1449: With this notation, Theorems \ref{thFPLStatic}--\ref{thFPL2} read

1450: $\ell_{1:T}\equiv E[\sum_{t=1}^T u_t(x_{1:t},q_t)]\leq f(x_{1:T})$

1451: for all $x_{1:T}\in\X^T$, where $f(x_{1:T})$ is one of the

1452: r.h.s.\ in Theorems \ref{thFPLStatic}--\ref{thFPL2}. Noting that

1453: $f$ is independent of $q_{1:T}$, we can write this as

1454: \beq\label{eqnAdapBnd}\label{defAt}

1455:   A_1\leq 0, \qmbox{where}

1456:   A_t(x_{<t},q_{<t})

1457:   := \max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big],

1458: \eeq

1459: where $E_{q_{t:T}}$ is the expectation w.r.t.\ $q_t...q_T$

1460: (keeping $q_{<t}$ fixed).

1461:

1462: For an adaptive adversary, $x_t=x_t(x_{<t},y_{<t},y_{<t}^{\FPL})$

1463: can additionally depend on $y_{<t}^{\FPL}$. Eliminating $y_t^i$

1464: and $y_t^{\FPL}$ we get, again, (\ref{eqnAdapDep}), but

1465: $x_t=x_t(x_{<t},q_{<t})$ is no longer fixed, but an (arbitrary)

1466: random function. So we have to replace $x_t$ by

1467: $x_t(x_{<t},q_{<t})$ in (\ref{eqnAdapBnd}) for $t=1..T$. The

1468: maximization is now a functional maximization over all functions

1469: $x_t(\cdot,\cdot)...x_T(\cdot,\cdot)$. Using ``$\max_{x(\cdot)}E_q

1470: [g(x(q),q)]=E_q\max_x[g(x,q)]$,$\!$'' we can write this as

1471: \beq\label{defBt}

1472:   B_1\stackrel?\leq 0, \qmbox{where}

1473:   B_t(x_{<t},q_{<t})

1474:   := \max_{x_t}E_{q_t}...\max_{x_T}E_{q_T}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big],

1475: \eeq

1476: So, establishing $B_1\leq 0$ would show that all bounds

1477: also hold in the adaptive case.

1478:

1479: \begin{lemma}[Adaptive=Oblivious]\label{lemAdap}

1480: Let $q_1...q_T\in\SetR^T$ be independent random variables,

1481: $E_{q_t}$ be the expectation w.r.t.\ $q_t$, $f$ any function of

1482: $x_{1:T}\in\X^T$, and $u_t$ arbitrary functions of $x_{1:t}$ and $q_t$.

1483: Then, $A_t(x_{<t},q_{<t})=B_t(x_{<t},q_{<t})$ for all $1\leq t\leq T$, where

1484: $A_t$ and $B_t$ are defined in (\ref{defAt}) and (\ref{defBt}).

1485: In particular, $A_1\leq 0$ implies $B_1\leq 0$.

1486: \end{lemma}

1487: %

1488: \paradot{Proof} We prove $B_t=A_t$ by induction on $t$, which

1489: establishes the theorem. $B_T=A_T$ is obvious. Assume $B_t=A_t$.

1490: Then

1491: \bqan

1492:   B_{t-1} &=& \max_{x_{t-1}}E_{q_{t-1}}B_t \;=\; \max_{x_{t-1}}E_{q_{t-1}} A_t

1493: \\

1494:   &=& \max_{x_{t-1}}E_{q_{t-1}}

1495:       \bigg[\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=1}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]\bigg]

1496: \\

1497:   &=& \max_{x_{t-1}}E_{q_{t-1}}

1498:       \bigg[\underbrace{\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau)}_{\hspace*{-3ex}\text{independent } x_{t:T} \text{ and } q_{t:T}} +

1499:             \underbrace{\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]}_{\text{independent $q_{t-1}$, since the $q_t$ are i.d.}} \bigg]

1500: \\

1501:   &=&

1502:   \max_{x_{t-1}} \bigg[\overbrace{E_{q_{t-1}}\Big[\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau)\Big]} +

1503:   \overbrace{\max_{x_{t:T}}E_{q_{t:T}}\Big[\sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\Big]\bigg]}

1504: \\

1505:   &=&

1506:   \max_{x_{t-1}}\max_{x_{t:T}}E_{q_{t:T}} \bigg[E_{q_{t-1}}\sum_{\tau=1}^{t-1} u_\tau(x_{1:\tau},q_\tau) +

1507:   \sum_{\tau=t}^T u_\tau(x_{1:\tau},q_\tau)-f(x_{1:T})\bigg]

1508:   \;\;=\;\; A_{t-1}

1509: \eqan\qed

1510:

1511: \begin{corollary}[FPL Bounds for adaptive adversary]\label{corAdap}

1512: Theorems \ref{thFPLStatic}--\ref{thFPL2} also hold for an adaptive

1513: adversary in case of independent randomization $q\leadsto q_t$.

1514: \end{corollary}

1515:

1516: Lemma \ref{lemAdap} shows that every bound of the form

1517: $A_1\leq 0$ proven for an oblivious adversary, implies an

1518: analogous bound

1519: $B_1\leq 0$ for an adaptive adversary. Note that this strong

1520: statement holds only for the \emph{full observation game},

1521: i.e.\ if after each time step we learn all losses. In partial

1522: observation games such as the Bandit case \citep{Auer:95}, our

1523: actual action may depend on our past action by means of our

1524: past observation, and the assertion no longer holds. In this

1525: case, FPL with an adaptive adversary can be analyzed as shown

1526: by \citet{McMahan:04,Poland:05actexp}.

1527: %

1528: Finally, $y_t^{\IFPL}$ can additionally depend on $x_t$, but the

1529: ``reduced'' dependencies (\ref{eqnAdapDep}) are the same as for

1530: FPL, hence, IFPL bounds also hold for adaptive adversary.

1531:

1532: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1533: \section{Miscellaneous}\label{secMisc}\label{secComp}

1534: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1535:

1536: %-------------------------------%

1537: \paradot{Bounds with high probability}

1538: %-------------------------------%

1539: We have derived several bounds for the expected loss $\ell_{1:T}$

1540: of FPL. The {\em actual} loss at time $t$ is

1541: $u_t=M(s_{<t}+{k-q\over\eta_t})\scp s_t$. A simple Markov inequality shows

1542: that the total actual loss $u_{1:T}$ exceeds

1543: the total expected loss $\ell_{1:T}=E[u_{1:T}]$ by a factor of

1544: $c>1$ with probability at most $1/c$:

1545: \beqn

1546:   P[u_{1:T}\geq c\!\cdot\!\ell_{1:T}]

1547:   \;\leq\; {1/c}.

1548: \eeqn

1549: Randomizing independently for each $t$ as described in the

1550: previous Section, the actual loss is

1551: $u_t=M(s_{<t}+{k-q_t\over\eta_t})\scp s_t$ with the same expected loss

1552: $\ell_{1:T}=E[u_{1:T}]$ as before. The advantage of independent

1553: randomization is that we can get a much better

1554: high-probability bound. We can exploit a Chernoff-Hoeffding

1555: bound \citep[Cor.5.2b]{McDiarmid:89}, valid for arbitrary

1556: independent random variables $0\leq u_t\leq 1$ for

1557: $t=1,...,T$:

1558: \beqn

1559:   P\Big[|u_{1:T}-E[u_{1:T}]|\geq\delta E[u_{1:T}]\Big]

1560:   \;\leq\; 2\exp(-{\textstyle{1\over 3}}\delta^2 E[u_{1:T}]), \qquad 0\leq\delta\leq 1.

1561: \eeqn

1562: For $\delta=\sqrt{3c/\ell_{1:T}}$ we get

1563: \beq\label{eqCH}

1564:   P[|u_{1:T}-\ell_{1:T}|\geq\sqrt{3c\ell_{1:T}}]

1565:   \;\leq\; 2\e^{-c}

1566:   \qmbox{as soon as}

1567:   \ell_{1:T}\geq 3c.

1568: \eeq

1569: Using (\ref{eqCH}), the bounds for $\ell_{1:T}$ of Theorems

1570: \ref{thFPLStatic}--\ref{thFPL2} can be rewritten to yield

1571: similar bounds with high probability ($1-2\e^{-c}$) for $u_{1:T}$

1572: with small extra regret $\propto\sqrt{c\cdot L}$ or $\propto\sqrt{c\cdot

1573: s_{1:T}^i}$.

1574: %

1575: Furthermore, (\ref{eqCH}) shows that with probability 1,

1576: $u_{1:T}/\ell_{1:T}$ converges rapidly to 1 for

1577: $\ell_{1:T}\to\infty$. Hence we may use the easier to compute

1578: $\eta_t=\sqrt{K/2u_{<t}}$ instead of

1579: $\eta_t=\sqrt{K/2(\ell_{<t}+1)}$, likely with similar bounds on the

1580: regret.

1581:

1582: %-------------------------------%

1583: \paradot{Computational Aspects}

1584: %-------------------------------%

1585: It is easy to generate the randomized decision of FPL. Indeed,

1586: only a single initial exponentially distributed vector

1587: $q\in\SetR^n$ is needed. Only for self-confident $\eta_t\propto

1588: 1/\sqrt{\ell_{<t}}$ (see Theorem~\ref{thFPLLDynamic}) we need to

1589: compute expectations explicitly. Given $\eta_t$, from $t\leadsto

1590: t+1$ we need to compute $\ell_t$ in order to update $\eta_t$. Note

1591: that $\ell_t=w_t\!\scp s_t$, where $w_t^i=P[I_t=i]$ and

1592: $I_t:=\arg\min_{i\in\E}\{s_{<t}^i+{k^i-q^i\over\eta_t}\}$ is the

1593: actual (randomized) prediction of FPL. With $s:=s_{<t}+k/\eta_t$,

1594: $P[I_t=i]$ has the following representation:

1595: \bqan

1596:   P[I_t=i]

1597:   &=& P[s-{q^i\over\eta_t}\leq s-{q^j\over\eta_t} \;\forall j\neq i] \\

1598:   &=& \int P[s-{q^i\over\eta_t}=m \;\wedge\; s-{q^j\over\eta_t}\geq m \;\forall j\neq i]dm \\

1599:   &=& \int P[q^i=\eta_t(s^i-m)]\cdot\prod_{j\neq i}P[q^j\leq \eta_t(s^j-m)]dm \\[-1ex]

1600:   &=& \int_{-\infty}^{s^{min}} \eta_t \e^{-\eta_t(s^i-m)}

1601:       \prod_{j\neq i}(1-\e^{-\eta_t(s^j-m)})dm \\

1602:   &=& \sum_{{\cal M}:\{i\}\subseteq{\cal M}\subseteq{\cal N}}\!\!

1603:   {\textstyle{(-)^{|{\cal M}|-1}\over|{\cal M}|}}\e^{-\eta_t\sum_{j\in\cal M}(s^j-s^{min})}

1604: \eqan

1605: In the last equality we expanded the product and performed the

1606: resulting exponential integrals. For finite $n$, the

1607: second to last one-dimensional integral should be numerically

1608: feasible. Once the product $\prod_{j=1}^n(1-\e^{-\eta_t(s^j-m)})$

1609: has been computed in time $O(n)$, the argument of the integral can

1610: be computed for each

1611: $i$ in time $O(1)$, hence the overall time to compute $\ell_t$ is

1612: $O(c\cdot n)$, where $c$ is the time to numerically compute one

1613: integral. For infinite $n$, the last sum may be approximated

1614: by the dominant contributions. Alternatively, one can modify

1615: the algorithm by considering only a finite pool of experts in

1616: each time step; see next paragraph. The expectation may also

1617: be approximated by (Monte Carlo) sampling $I_t$ several times.

1618:

1619: Recall that approximating $\ell\ltt$ can be avoided by

1620: using $s\ltt\smin$ (Theorem~\ref{thFPL2}) or $u\ltt$ (bounds with

1621: high probability) instead.

1622:

1623: %-------------------------------%

1624: \paradot{Finitized expert pool}

1625: %-------------------------------%

1626: In the case of an infinite expert class, FPL has to compute a

1627: minimum over an infinite set in each time step, which is not

1628: directly feasible. One possibility to address this is to

1629: choose the experts from a \emph{finite pool} in each time

1630: step. This is the case in the algorithm of \citet{Gentile:03},

1631: and also discussed by \citet{Littlestone:94}. For FPL, we can

1632: obtain this behavior by introducing an \emph{entering time}

1633: $\tau^i\geq 1$ for each expert. Then expert $i$ is not

1634: considered for $i<\tau^i$. In the bounds, this leads to an

1635: additional $\frac{1}{\eta_T}$ in Theorem \ref{thIFPL} and

1636: Corollary \ref{corIFPL} and a further additional $\tau^i$ in

1637: the final bounds (Theorems \ref{thFPLStatic}--\ref{thFPL2}),

1638: since we must add the regret of the best expert in hindsight

1639: which has already entered the game and the best expert in

1640: hindsight at all. Selecting

1641: $\tau^i=k^i$ implies bounds for FPL with entering times similar to

1642: the ones we derived here. The details and proofs for this

1643: construction can be found in \citep{Poland:05actexp}.

1644:

1645: %-------------------------------%

1646: \paradot{Deterministic prediction and absolute loss}

1647: %-------------------------------%

1648: Another use of $w_t$ from the second last paragraph is the following: If

1649: the decision space is $\D=\Delta$, then FPL may make a

1650: deterministic decision $d=w_t\in\Delta$ at time $t$ with bounds

1651: now holding for sure, instead of selecting $e_i$ with probability

1652: $w_t^i$. For example for the absolute loss $s_t^i=|x_t-y_t^i|$

1653: with observation $x_t\in[0,1]$ and predictions $y_t^i\in[0,1]$, a

1654: master algorithm predicting deterministically $w_t\!\scp

1655: y_t\in[0,1]$ suffers absolute loss $|x_t-w_t\!\scp y_t|\leq\sum_i

1656: w_t^i|x_t-y_t^i|=\ell_t$, and hence has the same (or better)

1657: performance guarantees as FPL. In general, masters can be chosen

1658: deterministic if prediction space $\Y$ and loss-function Loss$(x,y)$ are

1659: convex.

1660: %

1661: For $x_t,y_t^i\in\{0,1\}$, the absolute loss $|x_t-p_t|$ of a master

1662: deterministically predicting $p_t\in[0,1]$ actually coincides with

1663: the $p_t$-expected 0/1 loss of a master predicting 1 with

1664: probability $p_t$. Hence a regret bound for the absolute loss also

1665: implies the same regret for the 0/1 loss.

1666:

1667:

1668: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1669: \section{Discussion and Open Problems}\label{secConc}

1670: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1671:

1672: How does FPL compare with other expert advice algorithms? We

1673: briefly discuss four issues, summarized in Table \ref{tabregconst}.

1674:

1675: %-------------------------------%

1676: \paradot{Static bounds}

1677: %-------------------------------%

1678: Here the coefficient of the regret term $\sqrt{KL}$, referred to

1679: as the \emph{leading constant} in the sequel, is $2$ for FPL

1680: (Theorem~\ref{thFPLStatic}). It is thus a factor of $\sqrt 2$

1681: worse than the Hedge bound for arbitrary loss by

1682: \citet{Freund:97}, which is sharp in some sense \citep{Vovk:95}.

1683: This is the price one pays for FPL and its easy analysis for

1684: adaptive learning rate. There is evidence that this (worst-case)

1685: difference really exists and is not only a proof artifact.

1686: %-------------------------------%

1687: %\paradot{Special losses}

1688: %-------------------------------%

1689: For special loss functions, the bounds can sometimes be improved,

1690: e.g.\ to a leading constant of 1 in the static (randomized) WM

1691: case with 0/1 loss \citep{Cesa:97}\footnote{While FPL and Hedge and WMR

1692: \citep{Littlestone:94} can sample an expert without knowing its

1693: prediction, \citet{Cesa:97} need to know the experts' predictions.

1694: Note also that for many (smooth) loss-functions like the quadratic

1695: loss, finite regret can be achieved \citep{Vovk:90}.}.

1696: Because of the structure of the FPL algorithm however, it is

1697: questionable if corresponding bounds hold there.

1698:

1699: %-------------------------------%

1700: \paradot{Dynamic bounds}

1701: %-------------------------------%

1702: Not knowing the right learning rate in advance usually costs a

1703: factor of $\sqrt 2$. This is true for Hannan's algorithm

1704: \citep{Kalai:03} as well as in all our cases. Also for binary

1705: prediction with uniform complexities and 0/1 loss, this result

1706: has been established recently -- \citet{Yaroshinsky:04} show a

1707: dynamic regret bound with leading constant $\sqrt 2(1+\eps)$.

1708: Remarkably, the best dynamic bound for a WM variant proven by

1709: \citet{Auer:02pea} has a leading constant $2\sqrt 2$, which

1710: matches ours. Considering the difference in the static case,

1711: we therefore conjecture that a bound with leading constant of

1712: $2$ holds for a dynamic Hedge algorithm.

1713:

1714: %-------------------------------%

1715: \paradot{General weights}

1716: %-------------------------------%

1717: While there are several dynamic bounds for uniform weights, the

1718: only previous result for non-uniform weights we know of is

1719: \citep[Cor.16]{Gentile:03}, which gives the dynamic bound

1720: $\ell^{\mbox{\scriptsize Gentile}}\leqT\leq s^i\leqT+i+

1721: O\Big[\sqrt{(s^i\leqT+i)\ln(s^i\leqT+i)}\Big]$ for a $p$-norm

1722: algorithm for the absolute loss. This is comparable to our bound

1723: for rapidly decaying weights $w^i=\exp(-i)$, i.e.\ $k^i=i$. Our

1724: hierarchical FPL bound in Theorem \ref{thHFPL} $(b)$ generalizes

1725: this to arbitrary weights and losses and strengthens it, since

1726: both, asymptotic order and leading constant, are smaller.

1727:

1728: It seems that the analysis of all experts algorithms, including

1729: Weighted Majority variants and FPL, gets more complicated for

1730: general weights together with adaptive learning rate, because the

1731: choice of the learning rate must account for both the weight of

1732: the best expert (in hindsight) and its loss. Both quantities are

1733: not known in advance, but may have a different impact on the

1734: learning rate: While increasing the current loss estimate always

1735: decreases $\eta_t$, the optimal learning rate for an expert with

1736: higher complexity would be larger. On the other hand, all analyses

1737: known so far require decreasing $\eta_t$. Nevertheless we

1738: conjecture that the bounds $\propto\sqrt{Tk^i}$ and

1739: $\propto\sqrt{s_{1:T}^i k^i}$ also hold without the hierarchy

1740: trick, probably by using expert dependent learning rate

1741: $\eta_t^i$.

1742:

1743: \begin{table}[t]\centering\small

1744: \begin{tabular}{|c|c|c|c|c|}

1745:   \hline

1746:   $\eta$ & Loss & conjecture & Low.Bnd. & Upper Bound \\ \hline

1747:   static & 0/1 & 1            & 1?                         & 1 \citep{Cesa:97} \\

1748:   static & any & $\sqrt{2}$ ! & $\sqrt{2}$ \citep{Vovk:95} & $\sqrt{2}$ \cite[Hedge]{Freund:97}, 2 [FPL] \\

1749:  dynamic & 0/1 & $\sqrt{2}$   & 1? \citep{Hutter:03optisp} & $\sqrt{2}$ \cite{Yaroshinsky:04}, $2\sqrt{2}$ \cite[WM-Type?]{Auer:02pea} \\

1750:  dynamic & any & 2            & $\sqrt{2}$ \citep{Vovk:95} & $2\sqrt{2}$ [FPL], 2 \cite[Bayes]{Hutter:03optisp} \\

1751:   \hline

1752: \end{tabular}

1753:   \caption{\label{tabregconst}Comparison of the constants $c$ in regrets

1754:   $c\sqrt{\mbox{Loss}\times\ln n}$ for various settings and algorithms.}

1755: \end{table}

1756:

1757: %-------------------------------%

1758: \paradot{Comparison to Bayesian sequence prediction}

1759: %-------------------------------%

1760: We can also compare the \emph{worst-case} bounds for FPL obtained

1761: in this work to similar bounds for \emph{Bayesian sequence

1762: prediction}. Let $\{\nu_i\}$ be a class of probability

1763: distributions over sequences and assume that the true sequence is

1764: sampled from $\mu\in\{\nu_i\}$ with complexity $k^\mu$ ($\sum_i

1765: \e^{-k^{\nu_i}}\leq 1$). Then it is known that the Bayes optimal

1766: predictor based on the $\e^{-k^{\nu_i}}$-weighted mixture of

1767: $\nu_i$'s has an expected total loss of at most

1768: $L^\mu+2\sqrt{L^\mu k^\mu}+2k^\mu$, where $L^\mu$ is the expected

1769: total loss of the Bayes optimal predictor based on $\mu$

1770: \citep[Thm.2]{Hutter:02spupper},

1771: \citep[Thm.3.48]{Hutter:04uaibook}. Using FPL, we obtained

1772: the same bound except for the leading order constant, but for

1773: any sequence independently of the assumption that it is

1774: generated by $\mu$. This is another indication that a PEA

1775: bound with leading constant 2 could hold. See

1776: \citet{Hutter:04bayespea},

1777: \citet[Sec.6.3]{Hutter:03optisp} and

1778: \citet[Sec.3.7.4]{Hutter:04uaibook} for a more detailed

1779: comparison of Bayes bounds with PEA bounds.

1780:

1781: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1782: %         Bibliography        %

1783: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1784:

1785: \begin{small}

1786: \begin{thebibliography}{ACBFS95}

1787:

1788: \bibitem[ACBFS95]{Auer:95}

1789: P.~Auer, N.~Cesa-Bianchi, Y.~Freund, and R.~E. Schapire.

1790: \newblock Gambling in a rigged casino: The adversarial multi-armed bandit

1791:   problem.

1792: \newblock In {\em Proc. 36th Annual Symposium on Foundations of Computer

1793:   Science (FOCS 1995)}, pages 322--331, Los Alamitos, CA, 1995. IEEE Computer

1794:   Society Press.

1795:

1796: \bibitem[ACBG02]{Auer:02pea}

1797: P.~Auer, N.~Cesa-Bianchi, and C.~Gentile.

1798: \newblock Adaptive and self-confident on-line learning algorithms.

1799: \newblock {\em Journal of Computer and System Sciences}, 64(1):48--75, 2002.

1800:

1801: \bibitem[AG00]{Auer:00}

1802: P.~Auer and C.~Gentile.

1803: \newblock Adaptive and self-confident on-line learning algorithms.

1804: \newblock In {\em Proc. 13th Conf. on Computational Learning Theory}, pages

1805:   107--117. Morgan Kaufmann, San Francisco, CA, 2000.

1806:

1807: \bibitem[CB97]{Cesa:97}

1808: N.~Cesa-Bianchi{ et al.}

1809: \newblock How to use expert advice.

1810: \newblock {\em Journal of the ACM}, 44(3):427--485, 1997.

1811:

1812: \bibitem[FS97]{Freund:97}

1813: Y.~Freund and R.~E. Schapire.

1814: \newblock A decision-theoretic generalization of on-line learning and an

1815:   application to boosting.

1816: \newblock {\em Journal of Computer and System Sciences}, 55(1):119--139, 1997.

1817:

1818: \bibitem[Gen03]{Gentile:03}

1819: C.~Gentile.

1820: \newblock The robustness of the p-norm algorithm.

1821: \newblock {\em Machine Learning}, 53(3):265--299, 2003.

1822:

1823: \bibitem[Han57]{Hannan:57}

1824: J.~Hannan.

1825: \newblock Approximation to {Bayes} risk in repeated plays.

1826: \newblock In {\em Contributions to the Theory of Games 3}, pages 97--139.

1827:   Princeton University Press, 1957.

1828:

1829: \bibitem[HP04]{Hutter:04expert}

1830: M.~Hutter and J.~Poland.

1831: \newblock Prediction with expert advice by following the perturbed leader for

1832:   general weights.

1833: \newblock In {\em Proc. 15th International Conf. on Algorithmic Learning Theory

1834:   ({ALT-2004})}, volume 3244 of {\em LNAI}, pages 279--293, Padova, 2004.

1835:   Springer, Berlin.

1836:

1837: \bibitem[Hut03a]{Hutter:02spupper}

1838: M.~Hutter.

1839: \newblock Convergence and loss bounds for {Bayesian} sequence prediction.

1840: \newblock {\em IEEE Transactions on Information Theory}, 49(8):2061--2067,

1841:   2003.

1842:

1843: \bibitem[Hut03b]{Hutter:03optisp}

1844: M.~Hutter.

1845: \newblock Optimality of universal {B}ayesian prediction for general loss and

1846:   alphabet.

1847: \newblock {\em Journal of Machine Learning Research}, 4:971--1000, 2003.

1848:

1849: \bibitem[Hut04a]{Hutter:04bayespea}

1850: M.~Hutter.

1851: \newblock Online prediction -- {B}ayes versus experts.

1852: \newblock Technical report,

1853:   http://www.idsia.ch/$_{^\sim}$marcus/ai/bayespea.htm, July 2004.

1854: \newblock Presented at the EU PASCAL Workshop on Learning Theoretic and

1855:   Bayesian Inductive Principles (LTBIP-2004).

1856:

1857: \bibitem[Hut04b]{Hutter:04uaibook}

1858: M.~Hutter.

1859: \newblock {\em Universal Artificial Intelligence: Sequential Decisions based on

1860:   Algorithmic Probability}.

1861: \newblock Springer, Berlin, 2004.

1862: \newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.

1863:

1864: \bibitem[KV03]{Kalai:03}

1865: A.~Kalai and S.~Vempala.

1866: \newblock Efficient algorithms for online decision.

1867: \newblock In {\em Proc. 16th Annual Conf. on Learning Theory ({COLT-2003})},

1868:   Lecture Notes in Artificial Intelligence, pages 506--521, Berlin, 2003.

1869:   Springer.

1870:

1871: \bibitem[LW89]{Littlestone:89}

1872: N.~Littlestone and M.~K. Warmuth.

1873: \newblock The weighted majority algorithm.

1874: \newblock In {\em 30th Annual Symposium on Foundations of Computer Science},

1875:   pages 256--261, Research Triangle Park, NC, 1989. IEEE.

1876:

1877: \bibitem[LW94]{Littlestone:94}

1878: N.~Littlestone and M.~K. Warmuth.

1879: \newblock The weighted majority algorithm.

1880: \newblock {\em Information and Computation}, 108(2):212--261, 1994.

1881:

1882: \bibitem[MB04]{McMahan:04}

1883: H.~B. McMahan and A.~Blum.

1884: \newblock Online geometric optimization in the bandit setting against an

1885:   adaptive adversary.

1886: \newblock In {\em 17th Annual Conference on Learning Theory (COLT)}, volume

1887:   3120 of {\em LNCS}, pages 109--123. Springer, 2004.

1888:

1889: \bibitem[McD89]{McDiarmid:89}

1890: C.~McDiarmid.

1891: \newblock On the method of bounded differences.

1892: \newblock {\em Surveys in Combinatorics}, 141, London Mathematical Society

1893:   Lecture Notes Series:148--188, 1989.

1894:

1895: \bibitem[PH05]{Poland:05actexp}

1896: J.~Poland and M.~Hutter.

1897: \newblock Master algorithms for active experts problems based on increasing

1898:   loss values.

1899: \newblock In {\em Annual Machine Learning Conference of Belgium and the

1900:   Netherlands ({Benelearn-2005})}, Enschede, 2005.

1901:

1902: \bibitem[Vov90]{Vovk:90}

1903: V.~G. Vovk.

1904: \newblock Aggregating strategies.

1905: \newblock In {\em Proc. 3rd Annual Workshop on Computational Learning Theory},

1906:   pages 371--383, Rochester, New York, 1990. ACM Press.

1907:

1908: \bibitem[Vov95]{Vovk:95}

1909: V.~G. Vovk.

1910: \newblock A game of prediction with expert advice.

1911: \newblock In {\em Proc. 8th Annual Conf. on Computational Learning Theory},

1912:   pages 51--60. ACM Press, New York, NY, 1995.

1913:

1914: \bibitem[YEYS04]{Yaroshinsky:04}

1915: R.~Yaroshinsky, R.~El-Yaniv, and S.~Seiden.

1916: \newblock How to better use expert advice.

1917: \newblock {\em Machine Learning}, 55(3):271--309, 2004.

1918:

1919: \end{thebibliography}

1920: \end{small}

1921:

1922: \end{document}

1923:

1924: %--------------------End-of-Expertx.tex-----------------------%

1925: