0806:0806.4802/tr.tex

1: \documentclass{article}

2: \usepackage[letterpaper,margin=1in]{geometry}

3: \usepackage{amsfonts}

4: \usepackage{amsmath}

5: \usepackage{amssymb}

6: \usepackage{srcltx}

7: \usepackage{graphicx}

8:

9:

10: \def\jump{\vskip 0.05in}

11: \def\qed{\vrule height 7pt width 3pt depth 0pt}

12: \newenvironment{proof}{\noindent{\it Proof.} }{\qed\jump}

13: \newenvironment{sketch}{\noindent{\it Proof sketch.} }{\qed\jump}

14: \newtheorem{theorem}{Theorem}

15: \newtheorem{claim}{Claim}

16:

17: \def\N{{\mathbb{N}}}

18: \def\R{{\mathbb{R}}}

19: \def\wh{\widehat}

20:

21: \input{macros}

22:

23: \begin{document}

24: \title{A new Hedging algorithm and its application to inferring latent random variables}

25:

26: \author{Yoav Freund and Daniel Hsu \\

27: {\tt \{yfreund,djhsu\}@cs.ucsd.edu}}

28:

29: \maketitle

30:

31: \newcommand{\vp}{\vec{p}}

32: \newcommand{\dt}{\Delta t}

33:

34: \newcommand{\ctp}[2]{p_{#1}\paren{#2}}   % Continuous Time p

35: \newcommand{\ctg}[2]{g_{#1}\paren{#2}}  % Continuous Time g_i

36: \newcommand{\ctga}[1]{g_A\paren{#1}}     % Continuous Time g_A

37: \newcommand{\ctR}[2]{R_{#1}\paren{#2}}  % Continuous Time Regret

38: \newcommand{\ctr}[2]{r_{#1}\paren{#2}}

39: \newcommand{\ctd}[2]{d_{#1}\paren{#2}}

40:

41: \begin{abstract}

42:

43:   We present a new online learning algorithm for cumulative discounted

44:   gain. This learning algorithm does not use exponential weights on

45:   the experts. Instead, it uses a weighting scheme that depends on the

46:   regret of the master algorithm relative to the experts. In

47:   particular, experts whose discounted cumulative gain is smaller

48:   (worse) than that of the master algorithm receive zero weight.

49:   We also sketch how a regret-based algorithm can be used as an

50:   alternative to Bayesian averaging in the context of inferring latent

51:   random variables.

52:

53: \end{abstract}

54:

55: \section{Introduction} \label{sec:introduction}

56:

57: We study a variation on the online allocation problem presented by

58: Freund and Schapire in~\cite{FreundSc97}. Our problem varies from the

59: original in that we use {\em discounted} cumulative loss instead of

60: regular cumulative loss. Specifically, we consider the following

61: iterative game between a {\em hedger} and {\em Nature}.

62:

63: In this setting, there are $N$ actions (e.g.~strategies, experts) indexed

64: by $i$. The game between the hedger and Nature proceeds in iterations

65: $j=0,1,2,\ldots$. In the $j$th iteration:

66: \begin{enumerate}

67: \item The hedger chooses a distribution $\{p_i^j\}_{i=1}^N$ over the

68:   actions, where $p_i^j \geq 0$ and $\sum_{i=1}^N p_i^j = 1$.

69: \item Nature associates a gain $g_i^j \in [-1,1]$ with action $i$.

70: \item The gain of the hedger is $g^j_A = \sum_{i=1}^N p_i^j g_i^j$.

71: \end{enumerate}

72: We define the {\em discounted total gain} as follows. The initial

73: total gain is zero $G_i^0 = 0$. The total gain for action $i$ at the start

74: of iteration $j+1$ is defined inductively as:

75: \[

76: G_i^{j+1} \doteq (1 - \alpha) G_i^j + g_i^j

77: \]

78: for some fixed {\em discount factor} $\alpha>0$.

79: The discounted total loss of the hedger is similarly

80: defined:

81: \[

82: G_A^0=0, \quad G_A^{j+1} \doteq (1 - \alpha) G_A^j + g_A^j~.

83: \]

84: We define the {\em regret} of the hedger with respect to action $i$ at the

85: start of iteration $j$ as

86: \[

87: R_i^j \doteq G_i^j - G_A^j

88: \]

89: It is easy to see that the regret obeys the following recursion:

90: \[

91: R_i^0=0, \quad R_i^{j+1} = (1 - \alpha) R_i^j + g_i^j - g_A^j~.

92: \]

93: Our goal is to find a hedging algorithm for which we can show a small uniform

94: upper bound on the regret, i.e. a small positive real number

95: $B(\alpha)$ such that $R_i^j \leq B(\alpha)$ for all choices of Nature, all

96: $i$ and all $j$.

97:

98: Our new hedging algorithm, which we call {\bf NormalHedge}, uses the

99: following weighting:

100: \begin{equation} \label{eqn:Hedge-distribution}

101: w_i^j \doteq

102: \begin{cases}

103: R_i^j \exp \paren{\frac{\alpha \brackets{R_i^j}^2}{8}} & \text{if $R_i^j >

104: 0$} \\

105:                    0               & \text{if $R_i^j \leq 0$.}

106: \end{cases}

107: \end{equation}

108: The hedging distribution is equal to the normalized weights

109: $p_i^j = w_i^j / \sum_{k=1}^N w_k^j$ unless all of the weights

110: are zero, in which case we use the uniform distribution $p_i^j = 1/N$.

111:

112: Our main result is that if $\alpha$ is sufficiently small, the

113: following inequality holds uniformly over all game histories:

114: \begin{equation*}

115: {1 \over N} \sum_{i=1}^N \Phi\paren{\sqrt{\alpha} R_i^j} < 2.32

116: \end{equation*}

117: where

118: \[

119: \Phi\paren{x} =

120: \begin{cases}

121: \exp \paren{\frac{x^2}{8}}  & \text{if $x > 0$} \\

122:                    1               & \text{if $x \leq 0$.}

123:                    \end{cases}

124: \]

125: This implies, in particular, that for any $i$ and $j$,

126: \[

127: R_i^j \leq \sqrt{\frac{8 \ln 2.32N}{\alpha}}.

128: \]

129:

130: The discount factor $\alpha$ plays a similar role to the number of

131: iterations in the standard undiscounted cumulative loss framework.

132: %In

133: %order to compare this results to the results in~\cite{FreundSc97} we

134: %set $\alpha = 1/T$ and get that (for sufficiently large $T$) $R_i^j

135: %\leq \sqrt{8 T \ln 4N}$. This is very similar to the bound $R_i^T \leq

136: %\sqrt{2T \ln N}+\ln N$ on the loss of the exponential weights {\bf

137: %  Hedge} algorithm given in~\cite{FreundSc97} (Equation 11). The

138: %important difference is that the bounds for NormalHedge hold for any

139: %step $j$ while the bound for Hedge holds only for $j=T$. This is a

140: %significant improvement of NormalHedge over Hedge. In order to set the

141: %learning rate parameter $\beta$ in Hedge we need an a-priori upper

142: %bound on the total loss of the best expert. In NormalHedge the

143: %learning rate $\alpha$ is the discount factor we wish to use, and

144: %setting it does not require an a-priori upper bound on the total

145: %(discounted) loss of the best expert.

146: Indeed, it is easy to transform the usual exponential weights algorithms

147: from the standard framework (e.g.~Hedge \cite{FreundSc97}) to our

148: present setting (Section~\ref{sec:comparison}). Such algorithms also

149: enjoy discounted cumulative regret bounds of

150: \[ R_i^j \leq C \cdot \sqrt{\frac{\ln N}{\alpha}} \]

151: for some positive constant $C$, but they require knowledge of the number of

152: actions $N$ to tune a learning parameter. The tuning of NormalHedge does

153: not have this requirement\footnote{The guarantees afforded to NormalHedge

154: require $\alpha$ to be sufficiently smaller than $1/\ln N$, but this

155: restriction is operationally different from needing to know $N$ in

156: advance.}.

157:

158: The rest of this paper is organized as follows. In

159: Section~\ref{sec:drifting-game} we describe the main ideas behind the

160: construction and analysis of NormalHedge. In

161: Sections~\ref{sec:comparison} and \ref{sec:hedge} we discuss related work and compare NormalHedge to exponential

162: weights algorithms. Finally, in Section~\ref{sec:latent} we suggest

163: how to use NormalHedge to track latent variables and sketch how that

164: might be used for learning HMMs under the $L_1$ loss.

165:

166: \section{NormalHedge} \label{sec:drifting-game}

167:

168: \subsection{Preliminaries}

169:

170: NormalHedge and its analysis are based on the potential function

171: $\Phi(x)$ introduced in Section~\ref{sec:introduction}. Here

172: we give a slightly more elaborate definition for $\Phi(x)$ that includes a

173: constant $c$. The potential function is a %twice-differentiable

174: non-decreasing function of $x \in \R$

175: \begin{equation} \label{eqn:potential}

176: \Phi(x) \doteq \begin{cases}e^{x^2/2c} & \text{if $x > 0$} \\

177:                       1        & \text{if $x \leq 0$}

178:                       \end{cases}

179: \end{equation}

180: where $c > 1$. In our current version of NormalHedge, $c=4$. Decreasing

181: $c$ will improve the bound on the regret; we will also argue that $c$

182: cannot be decreased to $1$.

183:

184: The weights assigned by NormalHedge are set proportional to the first

185: derivative of $\Phi$, i.e.~$w_i^j = \Phi'(R_i^j)$, where

186: $$ \Phi'(x) = \begin{cases}{x \over c}e^{x^2/2c} & \text{if $x > 0$} \\

187:                    0                   & \text{if $x \leq 0$.}

188:                    \end{cases} $$

189: In our analysis, we will also need to examine the second derivative of

190: $\Phi$:

191: $$ \Phi''(x) = \begin{cases}\paren{{1 \over c} + {x^2 \over c^2}}e^{x^2/2c}

192:                                        & \text{if $x > 0$} \\

193:                    0                   & \text{if $x < 0$.}

194:                    \end{cases} $$

195: Note that $\Phi''(x)$ has a discontinuity at $x=0$.

196:

197: \subsection{An intuitive derivation}

198:

199: The intuition behind the potential function is based on considering

200: the following strategy for Nature. Suppose there are two types of

201: actions, {\em good} actions and {\em poor} actions. The gain for each

202: action on each iteration is chosen independently at random from a

203: distribution over $\{-1,+1\}$. The distribution for poor actions has

204: equal probabilities $1/2,1/2$ on the two outcomes, while the

205: distribution for the good experts is $(1+\gamma)/2$ on $+1$ and

206: $(1-\gamma)/2$ on $-1$ for some very small $\gamma>0$. Clearly, the

207: best hedging strategy is to put equal positive weights on the good

208: actions and zero weight on the poor actions. Unfortunately, the

209: hedging algorithm does not know at the beginning of the game which

210: experts are good, so it has to learn these weights online. Assuming

211: that the number of actions is infinite (or sufficiently large), the

212: per-iteration gain of the optimal weighting is $\gamma$, which implies

213: that the discounted cumulative gain of this strategy is $\gamma/\alpha$.

214:

215: Consider the regrets of this optimal hedging with respect to the good

216: actions. It is not hard to show that the expected value of the

217: discounted cumulative gain of a good action is $\gamma/\alpha$ and

218: that the variance is approximately $1/\alpha$ (becomes exact as

219: $\gamma \to 0$). Moreover, if $\alpha \to 0$ this distribution

220: approaches a {\em normal} distribution with mean $\gamma/\alpha$ and

221: variance $1/\alpha$. In other words, the distribution of the regrets

222: of optimal hedging with respect to the good actions is

223: $(1/Z)\exp(-\alpha R^2/2)$.

224:

225: Consider the expected value of the potential function

226: $\Phi(\sqrt{\alpha}R)$ for this distribution over the regrets. If we

227: set $c=1$ we find that the product of the probability of the regret

228: $R$ and the potential for the regret $R$ is a constant independent of

229: $R$:

230: $$ \frac1Z \cdot \exp\left(-\frac{\alpha R^2}{2} \right) \cdot \exp\left(\frac{\alpha

231: R^2}{2} \right) = \frac1Z = \Omega(1). $$

232: Thus the expected potential is infinite. However, if we set $c$ to be

233: larger than $1$ then the expected value of the potential function becomes

234: finite. Thus, roughly speaking, the potential associated with a regret

235: value is the reciprocal of the probability of that regret value being a

236: result of random fluctuations. This level of regret is unavoidable. The

237: design of NormalHedge is based on the goal of not allowing the average

238: regret to grow beyond this level that is generated by random fluctuations.

239: Ideally, we would be able to use a potential function with any constant $c$

240: larger than 1. However, what we are able to prove is that the algorithm

241: works for $c=4$.

242:

243: The idea of NormalHedge is to keep the average potential small. It is

244: therefore natural that the weight assigned to each action is proportional

245: to the derivative of the potential. Indeed, it is easily checked that the

246: weights $w_i^j$ defined in Equation~\eqref{eqn:Hedge-distribution} are

247: proportional to $\Phi'(\sqrt{\alpha}R_i^j)$.

248: This derivative, however, is best viewed when the hedging game is mapped

249: into continuous time.

250: %However, in order to make this

251: %use of a derivative sensible, we need to map the hedging game into

252: %continuous time.

253:

254: \subsection{The continuous time limit}

255: Our analysis of NormalHedge is based on mapping the integer time steps

256: $j=0,1,2,\ldots$ into real-valued time steps $t=0,\alpha,2\alpha,\ldots$

257: and then taking the limit $\alpha \to 0$. Formally, we redefine

258: the hedging game using a different notation which uses the real valued

259: time $t$ instead of the time index $j$. We assume a set of $N$

260: actions (experts), indexed by $i$. The game between the hedging

261: algorithm and Nature proceeds in iterations

262: $t=0,\alpha,2\alpha,\ldots$. At each iteration the

263: following sequence of actions take place.

264:

265: \begin{enumerate}

266: \item The hedging algorithm defines a distribution $\braces{\ctp{i}{t}}_{i=1}^N$ over the

267:   actions. $\ctp{i}{t} \geq 0;\;\; \sum_{i=1}^N \ctp{i}{t} = 1$.

268: \item Nature associates a gain $\ctg{i}{t} \in [-\sqrt{\alpha},+\sqrt{\alpha}]$ with action $i$.

269: \item The gain of the hedger is $\ctga{t} = \sum_{i=1}^N \ctp{i}{t} \ctg{i}{t}$.

270: \end{enumerate}

271:

272: We skip the definitions of $G_i(t)$ and $G_A(t)$ as these can become

273: ill-behaved when $\alpha \to 0$. Instead we define the regret directly:

274: \[

275: \ctR{i}{0}=0,\;\; \ctR{i}{t+\alpha} = (1- \alpha)\ctR{i}{t} + \ctg{i}{t} - \ctga{t}~.

276: \]

277: Note that this definition of the regret is a scaled version of the

278: discrete time regret:

279: \[

280: \ctR{i}{j\alpha} = \sqrt{\alpha} R_i^j.

281: \]

282:

283: We now have the tools needed to prove our main result.

284: \begin{theorem} \label{thm:main}

285: There exists a positive constant $C < 2.32$ such that if $\alpha < 1/(800

286: \ln CN)$, then for any sequence of gains and any iteration $j$

287: $$ \frac1N \sum_{i=1}^N \Phi\paren{\sqrt{\alpha} R_i^j} < C. $$

288: \end{theorem}

289: %The proof of the theorem is given in the appendix.

290: %

291: \begin{sketch}

292: The full proof is given in the appendix, but here we sketch a

293: continuous-time argument (i.e.~we consider $\alpha \to 0$). The formal,

294: discrete-time proof shows that it is enough for $\alpha \leq 1/(800 \ln

295: CN)$.

296:

297: We want to show that the average potential

298: \[ \Psi(t) \doteq \frac1N \sum_{i=1}^N \Phi(t) \]

299: is bounded for all time $t$. Our approach is to show that its

300: time-derivative

301: \[ \frac{\partial}{\partial t} \Psi(t) = \lim_{\alpha \to 0} \frac1\alpha

302: \cdot \frac1N \sum_{i=1}^N \left\{ \Phi(\ctR{i}{t+\alpha}) -

303: \Phi(\ctR{i}{t}) \right\} \]

304: becomes non-positive as soon as $\Psi(t)$ is above some constant (recall

305: that the time steps are in increments of $\alpha$). Since the $\Phi(x)$ is

306: constant for $x < 0$, we need only consider $i$ such that

307: $\ctR{i}{t+\alpha} \geq 0$. Ignoring the discontinuity of $\Phi''(x)$ at

308: $x=0$, Taylor's theorem implies that for some $\rho_i \leq

309: \max\{\ctR{i}{t}, \ctR{i}{t+\alpha}\}$,

310: \begin{eqnarray*}

311:  \sum_{i: \ctR{i}{t} \geq 0}

312: \Phi(\ctR{i}{t+\alpha}) - \Phi(\ctR{i}{t})

313: & = & \sum_{i: \ctR{i}{t} \geq 0} \Phi((1-\alpha) \ctR{i}{t} + g_i(t) - g_A(t)) - \Phi(\ctR{i}{t}) \\

314: & = & \sum_{i: \ctR{i}{t} \geq 0} (-\alpha \ctR{i}{t} + g_i(t) - g_A(t))

315: \Phi'(\ctR{i}{t}) \\

316: & & \quad \quad \quad \mbox{} + \frac12 (g_i(t) - g_A(t) - \alpha \ctR{i}{t})^2 \Phi''(\rho_i) \\

317: & \leq & \sum_{i: \ctR{i}{t} \geq 0} -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) + \frac12 (g_i(t) - g_A(t) -

318: \alpha \ctR{i}{t})^2 \Phi''(\rho_i) \\

319: & \leq & \sum_{i: \ctR{i}{t} \geq 0} -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) +

320: \frac12 (2\sqrt{\alpha} + \alpha

321: \ctR{i}{t})^2 \Phi''(\ctR{i}{t} + 2\sqrt{\alpha}).

322: \end{eqnarray*}

323: The first inequality uses the fact that the weights are proportional to the

324: derivatives of the potentials

325: $$ \sum_{i: \ctR{i}{t} \geq 0} g_i(t) \cdot

326: \frac{\Phi'(\ctR{i}{t})}{\sum_{j: \ctR{j}{t} \geq 0} \Phi'(\ctR{j}{t})} =

327: g_A(t), $$

328: and the second inequality follows because $|g_i(t) - g_A(t)| \leq

329: 2\sqrt{\alpha}$. Now dividing by $\alpha$ and $N$ and taking the limit

330: $\alpha \to 0$, we have

331: \begin{eqnarray*}

332: \frac{\partial}{\partial t} \Psi(t)

333: & \leq & \lim_{\alpha \to 0} \frac1\alpha \cdot \frac1N \sum_{i: \ctR{i}{t}

334: \geq 0} \frac12 (2\sqrt{\alpha} + \alpha \ctR{i}{t})^2 \Phi''(\ctR{i}{t} +

335: 2\sqrt{\alpha}) -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) \\

336: & = & \lim_{\alpha \to 0} \frac1N \sum_{i: \ctR{i}{t}

337: \geq 0} \frac12 (2 + \sqrt{\alpha}\ctR{i}{t})^2 \Phi''(\ctR{i}{t} +

338: 2\sqrt{\alpha}) - \ctR{i}{t} \Phi'(\ctR{i}{t}) \\

339: & = & \frac1N \left\{ 2 \left( \frac1c + \frac{\ctR{i}{t}^2}{c^2} \right)

340: \exp(\ctR{i}{t}^2/2c) - \frac{\ctR{i}{t}^2}{c} \exp(\ctR{i}{t}^2/2c) \right\}

341: \\

342: & \leq & \frac2c \Psi(t) + \frac1{cN} \sum_{i=1}^N \left( \frac2c - 1 \right)

343: \ctR{i}{t}^2 \exp(\ctR{i}{t}^2/2c).

344: \end{eqnarray*}

345: If $\Psi(t) \geq B$, then this final RHS is maximized when $R_i(t) \equiv

346: \sqrt{2c\ln B}$ for all $i$, whereupon

347: $$ \frac{\partial}{\partial t} \Psi(t) \leq \frac{2B}{c} \left( 1 + \left(

348: \frac2c - 1\right) c \ln B \right). $$

349: This is non-positive for sufficiently large $B$ and $c \geq 2 + 1/\ln B$.

350: \end{sketch}

351:

352: \section{Related work} \label{sec:comparison}

353:

354: \subsection{Relation to other online learning algorithms}

355:

356: The Hedge algorithm~\cite{FreundSc97}, as well as most of the work on

357: online learning algorithms is based on exponential weighting, where

358: the weight assigned to an expert is exponential in the cumulative loss

359: of that expert. NormalHedge uses a very different weighting

360: scheme. The most important difference is that the weight of an expert

361: depends on the regret of the master algorithm relative to that expert,

362: rather than just on the loss of the algorithm. In particular, experts

363: whose discounted cumulative loss is larger than that of the master

364: algorithm receive zero weight. We expand on the comparison of NormalHedge

365: to Hedge in Section~\ref{sec:hedge}.

366:

367: The starting point for the derivation and analysis of NormalHedge is

368: the Binomial Weights algorithm of Cesa-Bianchi et

369: al~\cite{CesabianchiFrHeWa96}. The Binomial weights algorithm is an

370: algorithm for a restricted version of the experts prediction

371: problem~\cite{LittlestoneWa94,CesabianchiFrHeHaScWa97}. In this version

372: sequence to be predicted is binary and all of the predictions are also

373: binary. The Binomial Weights algorithm is analyzed using a type of

374: {\em chip game}. In this game each expert is represented as a chip, at

375: each iteration each chip has a location on the integer line. The

376: position of the chip corresponds to the number of mistakes that were

377: made by the expert. The a-priori assumption is that there is at least

378: one experts which makes at most $k$ mistakes, and the goal is to

379: define a rule for combining the experts predictions in a way that

380: would minimize the maximal number of mistakes of the master expert.

381:

382: The chip game analysis leads naturally to the definition of the {\em

383:   potential function} and the evolution of this potential function

384: from iteration to iteration yields the Binomial Weights algorithm.  A

385: closely related notion of potential was used in the Boost-by-Majority

386: algorithm. The chip-game analysis was extended by Schapire's work on

387: drifting games~\cite{Schapire01} and by Freund and Opper's work on drifting

388: games in continuous time~\cite{FreundOp02}. NormalHedge naturally extends

389: the continuous time drifting games to a setting in which one seeks to

390: minimize discounted loss.

391:

392: \subsection{Relation to switching and sleeping experts}

393:

394: The use of discounted cumulative loss represents an alternative to the

395: ``switching experts'' framework of Warmuth and

396: Herbster~\cite{HerbsterWa98}. If the best expert changes at a rate of

397: $O(\alpha)$, then NormalHedge

398: will switch to the new best expert because the losses that occurred more

399: than $1/\alpha$ iterations ago make a small contribution to the discounted

400: total loss.

401:

402: A useful extension of NormalHedge is to using experts that can

403: abstain, similar to the setup studied in ~\cite{FreundScSiWa97}. To do this we assume

404: that each expert $i$, at each iteration $j$, outputs a confidence

405: level $0 \leq c \leq 1$. Instead of using the vector

406: $\{p_i^j\}_{j=1}^N$ the hedger uses the vector $\{p_i^j

407: c_i^j/Z^j\}_{j=1}^N$ where $Z = \sum_{i=1}^N p_i^j c_i^j$. The gain

408: $g_i^j$ of action $i$ at iteration $j$ is replaced by $c_i^jg_i^j$,

409: and the discounted cumulative gain and the discounted cumulative

410: regret change in the corresponding way. The bounds on the average

411: potential transfer without change. This allows an expert to abstain

412: from making a prediction. By setting $c_i^j=0$ the expert effectively

413: removes itself from the pool of experts used by the hedger. It also

414: avoids suffering any loss. However, an expert cannot always abstain,

415: because then it's discounted cumulative gain will be driven to zero by

416: the discount factor.

417: We will use this extension in Section~\ref{sec:latent}.

418:

419: \section{Comparison of NormalHedge and Hedge} \label{sec:hedge}

420:

421: \subsection{Discounted regret bound for Hedge}

422:

423: To ease the comparison, we first recast the Hedge algorithm

424: \cite{FreundSc97} into our current framework with discounted gains. The

425: weights used by Hedge are

426: $$ w_i^j \doteq \exp(\eta G_i^j) $$

427: where $G_i^j$ is the discounted cumulative gain of action $i$ at the

428: start of iteration $j$, and $\eta > 0$ is the learning rate parameter.

429: When written recursively as

430: $$ w_i^{j+1} = \exp\left(\eta ((1-\alpha) G_i^j + g_i^j) \right)

431: \propto \left( w_i^j \right)^{1-\alpha} \exp(\eta g_i^j), $$

432: we see that the effect of discounting is a dampening of the previous

433: weights $w_i^j$ prior to the usual multiplicative update rule.

434:

435: Fix any iteration $j$ and define the adjusted cumulative gain of action $i$

436: at the start of iteration $k$ to be

437: \[ \wh G_i^k = \sum_{s=1}^{k-1} (1-\alpha)^{j-1-s} g_i^s \]

438: with $G_i^0 = 0$. The gain of Hedge in iteration $k$ is

439: \[ g_A^k =

440: \frac{\sum_{i=1}^N w_i^k g_i^k}{\sum_{i=1}^N w_i^k}

441: = \frac{\sum_{i=1}^N e^{\eta \wh G_i^k}

442: g_i^k}{\sum_{i=1}^N e^{\eta \wh G_i^k}}

443: \]

444: and the adjusted cumulative gain of Hedge at the start of iteration $k$ is

445: \[ \wh G_A^k = \sum_{s=1}^{k-1} (1-\alpha)^{j-1-s} g_A^s. \]

446: Then the discounted cumulative regret to action $i$ at the start of

447: iteration $j$ is $\wh G_i^j - \wh G_A^j$.

448:

449: We analyze the (log of the) ratios $W_k / W_{k-1}$, where

450: \[ W_k = \sum_{i=1}^N e^{\eta \wh G_i^k} \]

451: and $W_0 = N$. We lower bound $\ln(W_j / W_0)$ as

452: \[ \ln \frac{W_j}{W_0} = \ln \sum_{i=1}^N e^{\eta \wh G_i^j} - \ln N \geq

453: \ln e^{\eta \wh G_i^j} - \ln N = \eta \wh G_i^j - \ln N \]

454: (for any $i$), and we upper bound it as

455: \begin{align*}

456: \ln \frac{W_j}{W_0}

457: & = \sum_{k=1}^{j-1} \ln \frac{W_j}{W_{j-1}} \\

458: & = \sum_{k=1}^{j-1} \ln \frac{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}} e^{\eta

459: (1-\alpha)^{j-1-k} g_i^k}}{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}}} \\

460: & \leq \sum_{t=1}^T \eta \cdot \frac{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}}

461: (1-\alpha)^{j-1-k} g_i^k}{\sum_{i=1}^N e^{\eta G_i^{k-1}}} +

462: \frac{\eta^2}{8} \cdot 4(1-\alpha^{2(j-1-k)} \quad \text{(Hoeffding's

463: inequality)} \\

464: & = \sum_{k=1}^{j-1} \eta (1-\alpha)^{j-1-k} g_A^k + \frac{\eta^2}{2}

465: (1-\alpha)^{2(j-1-k)} \\

466: & = \eta \wh G_A^k + \frac{\eta^2}{2} \cdot \frac{1}{1-(1-\alpha)^2} \\

467: & = \eta \wh G_A^k + \frac{\eta^2}{4(\alpha - \alpha^2/2)}.

468: \end{align*}

469: Therefore, the discounted cumulative regret of Hedge to action $i$ at the

470: start of any iteration $j$ is

471: $$ R_i^j = \wh G_i^j - \wh G_A^j \leq \frac{\ln N}{\eta}

472: + \frac{\eta}{4(\alpha - \alpha^2/2)}. $$

473: Choosing $\eta = \sqrt{4(\alpha - \alpha^2/2) \ln N}$ gives

474: $$ R_i^j \leq \sqrt{\frac{\ln N}{\alpha - \alpha^2/2}}. $$

475:

476: The regret bound is of the same form as that implied by

477: Theorem~\ref{thm:main}, indeed, with better leading constants. However,

478: this bound only holds when $\eta$ is tuned with knowledge of the number of

479: actions $N$. If instead one sets $\eta = \Theta(\sqrt{\alpha})$

480: independently of $N$, the bound for Hedge is worse by a factor of

481: $\Theta(\sqrt{\ln N})$. Furthermore, this setting of $\eta$ is for

482: optimizing a bound that anticipates the worst-case sequence of gains; when

483: Nature is not optimally adversarial, then a proper setting of $\eta$ may

484: require other prior knowledge.

485:

486: \subsection{Simulations}

487:

488: \subsubsection{The effect of good experts}

489:

490: To empirically compare Hedge and NormalHedge, we first simulated the two

491: algorithms in a scenario similar to that described in

492: Section~\ref{sec:drifting-game}:

493: \begin{itemize}

494: \item The number of experts is $N = 1000$, and the discount parameter is

495: $\alpha = 0.001$.

496: \item At any given time, there is a set of $N_G = f \cdot N$ good experts

497: and $N - N_G$ bad experts. (We varied $f \in \{ 0.001, 0.01, 0.1, 0.5 \}$.)

498:   \begin{itemize}

499:   \item With probability $0.5 + \gamma/2$, \emph{every} good expert

500:   receives gain $+1$; with probability $0.5 - \gamma/2$, \emph{every} good

501:   expert receives gain $-1$. (We varied $\gamma \in \{ 0.2, 0.4, 0.6, 0.8

502:   \}$.)

503:   \item Bad experts receive gain $+1$ and $-1$ with equal probability.

504:   \end{itemize}

505: \item Initially, the set of good experts is $\{ 0, 1, \ldots, N_G-1 \}$.

506: \item After every $1/\alpha$ iterations, the set of good experts shifts

507: from $\{ i_0, i_0 + 1, \ldots, i_0 + N_G-1 \}$ to $\{ i_0 + N_G, i_0 +

508: N_G+1, \ldots, i_0 + 2N_G-1 \}$ (with addition modulo $N$).

509: \end{itemize}

510: Thus, the set of good experts completely changes every $1/\alpha$

511: iterations. In each iteration, all good experts receive the same gain,

512: which is $\gamma$ in expectation. In contrast, the gain of each bad expert

513: is decided independently with a fair coin.

514:

515: We tuned the learning rate parameter for Hedge to $\eta = \sqrt{(\alpha -

516: \alpha^2/2) \ln N}$. For NormalHedge, we varied $c \in \{ 1, 2, 4\}$.

517: Recall that the regret bound we can show for NormalHedge holds for $c = 2$

518: as $\alpha \to 0$ (the formal proof is stated with $c = 4$).

519:

520: Figures~\ref{fig:sim1-1} and \ref{fig:sim1-2} depict the discounted

521: cumulative regret to the best expert (averaged over $50$ runs). First, we

522: observe that NormalHedge fares better than Hedge when the advantage of the

523: good experts is large and the fraction of experts that are good is large.

524: In such cases, the advantage of NormalHedge is especially pronounced within

525: $1/\alpha$ iterations (before the set of good experts shifts). Second, we

526: observe that the performance of NormalHedge generally improves as the value

527: of $c$ is decreased. Indeed, the setting of $c = 1$ (for which we have no

528: theoretical guarantees) yields the best results for NormalHedge (and in

529: fact outperforms Hedge in every simulation). It would be very interesting

530: to establish guarantees for NormalHedge for $c \to 1$.

531:

532: \begin{figure}

533: \begin{center}

534: \begin{tabular}{cc}

535: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.00100-reg.eps} &

536: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.00100-reg.eps} \\

537: $\gamma = 0.2, f = 0.001$ & $\gamma = 0.4, f = 0.001$ \\

538: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.00100-reg.eps} &

539: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.00100-reg.eps} \\

540: $\gamma = 0.6, f = 0.001$ & $\gamma = 0.8, f = 0.001$ \\

541: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.01000-reg.eps} &

542: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.01000-reg.eps} \\

543: $\gamma = 0.2, f = 0.01$ & $\gamma = 0.4, f = 0.01$ \\

544: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.01000-reg.eps} &

545: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.01000-reg.eps} \\

546: $\gamma = 0.6, f = 0.01$ & $\gamma = 0.8, f = 0.01$

547: \end{tabular}

548: \end{center}

549: \caption{Regrets to the best expert in the first simulation;

550: $\gamma \in \{ 0.2, 0.4, 0.6, 0.8 \}$ and $f \in \{ 0.001, 0.01 \}$.}

551: \label{fig:sim1-1}

552: \end{figure}

553:

554: \begin{figure}

555: \begin{center}

556: \begin{tabular}{cc}

557: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.10000-reg.eps} &

558: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.10000-reg.eps} \\

559: $\gamma = 0.2, f = 0.1$ & $\gamma = 0.4, f = 0.1$ \\

560: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.10000-reg.eps} &

561: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.10000-reg.eps} \\

562: $\gamma = 0.6, f = 0.1$ & $\gamma = 0.8, f = 0.1$ \\

563: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.50000-reg.eps} &

564: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.50000-reg.eps} \\

565: $\gamma = 0.2, f = 0.5$ & $\gamma = 0.4, f = 0.5$ \\

566: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.50000-reg.eps} &

567: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.50000-reg.eps} \\

568: $\gamma = 0.6, f = 0.5$ & $\gamma = 0.8, f = 0.5$

569: \end{tabular}

570: \end{center}

571: \caption{Regrets to the best expert in the first simulation;

572: $\gamma \in \{ 0.2, 0.4, 0.6, 0.8 \}$ and $f \in \{ 0.1, 0.5 \}$.}

573: \label{fig:sim1-2}

574: \end{figure}

575:

576: \subsubsection{The effect of tuning $\eta$ in Hedge}

577:

578: Next, to bring out the issue with parameter tuning in Hedge, we conducted a

579: simulation in which we fix the fraction of experts that are good, but vary

580: the total number of experts:

581: \begin{itemize}

582: \item The number of experts is $N$, and the discount parameter is

583: $\alpha = 0.001$. (We varied $N \in \{ 10, 100, 1000 \}$.)

584: \item The fraction of experts that are good is fixed at $f = 0.1$. The

585: notion of good and bad experts is the same as in the first simulation. (We

586: varied $\gamma \in \{ 0.2, 0.8 \}$.)

587: \item The remaining details are the same as in the first simulation.

588: \end{itemize}

589: Again, we tuned the learning rate parameter for Hedge to $\eta =

590: \sqrt{(\alpha - \alpha^2/2) \log N}$, which now changes as we vary the

591: total number of experts, and we varied $c \in \{ 1, 2, 4 \}$ in

592: NormalHedge.

593:

594: The results (Figure~\ref{fig:sim2-1}) indicate that as $N$ decreases

595: (e.g.~$N = 100, 10$), the disparity between Hedge and NormalHedge

596: increases. We believe this is an issue with tuning the learning rate

597: $\eta$, which is conspicuously absent in NormalHedge, but we have not

598: precisely characterized the issue.

599:

600: \begin{figure}

601: \begin{center}

602: \begin{tabular}{cc}

603: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-1000-reg.eps} &

604: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-1000-reg.eps} \\

605: $\gamma = 0.2, N = 1000$ & $\gamma = 0.8, N = 1000$ \\

606: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-100-reg.eps} &

607: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-100-reg.eps} \\

608: $\gamma = 0.2, N = 100$ & $\gamma = 0.8, N = 100$ \\

609: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-10-reg.eps} &

610: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-10-reg.eps} \\

611: $\gamma = 0.2, N = 10$ & $\gamma = 0.8, N = 10$

612: \end{tabular}

613: \end{center}

614: \caption{Regrets to the best expert in the second simulation;

615: $\gamma \in \{ 0.2, 0.8 \}$ and $N \in \{ 1000, 100, 10 \}$.}

616: \label{fig:sim2-1}

617: \end{figure}

618:

619: %

620: %

621: %

622: %\begin{figure}

623: %\begin{center}

624: %\includegraphics[width=0.5\textwidth]{shifting-reg-1000.eps} \\

625: %\vskip 0.2in

626: %\includegraphics[width=0.5\textwidth]{shifting-reg-100.eps} \\

627: %\vskip 0.2in

628: %\includegraphics[width=0.5\textwidth]{shifting-reg-10.eps}

629: %\vskip 0.1in

630: %\end{center}

631: %\caption{Regrets in the second simulation.} \label{fig:sim2}

632: %\end{figure}

633: %

634: %

635: \section{Inferring latent random variables} \label{sec:latent}

636:

637: An important problem in statistical inference is to make predictions

638: or choose actions when the system under consideration has internal

639: states that cannot be observed directly. There are many manifestations

640: of this problem, including Graphical models, Hidden Markov Models

641: (HMMs), Partially Observable Markov Decision Processes (POMDPs) and

642: Kalman filters. The common method for dealing with hidden states is to

643: model them as {\em latent random variables}. The relation between the

644: latent random variables and the observable random variables is modeled

645: using a joint probability distribution. Two very important

646: sub-problems that arise in this approach are learning joint

647: distributions the involve latent random variables from examples that

648: contain only the state of the observable random variables and using

649: this type of joint distributions to infer the value of some

650: variables given the state of others. At this time there is no good

651: universal solution to either of these sub-problems.

652:

653: We propose a different approach to the problem, where instead of

654: associating hidden states with hidden random variables, we associate

655: states with different experts. What we present here describes some

656: initial ideas. It is not an attempt to propose a solution to this

657: large and complex problem.

658:

659: Suppose that we are to predict a binary sequence $x_1,x_2,\ldots$,

660: $x_t \in \{0,1\}$ and suppose that we believe that the sequence can be

661: predicted reasonably well using a Hidden Markov Model. Specifically,

662: suppose there is a hidden state $S$ which attains one of the values

663: $1,\ldots,k$ at each time step. Suppose that the state transition is

664: Markovian and stationary, i.e.

665: \[

666: P(S_t |S_{t-1},S_{t-2},\ldots) = P(S_t | S_{t-1}) = P(S_{t-1}|S_{t-2})

667: = \cdots

668: \]

669: Assume in addition that the hidden state does not change very often,

670: i.e. $P(S_{t+1} = S_t)$ is close to 1. Finally, assume that the

671: distribution of the observable variable $X_t$ depends only on the

672: hidden state at the same time $S_t$.

673:

674: Consider the problem of predicting $X_{t+1}$ given $x_1,\ldots,x_t$

675: and the parameters of the HMM. Suppose that the prediction needs to

676: take the form of a distribution over $\Sigma$. So far this is exactly

677: the standard framework, but suppose we differ from the standard

678: framework by considering the $L_1$ loss $1-p_t(x_t)$, where $p_t(x_t)$

679: is the predicted probability assigned to the letter that actually

680: occured at time $t$. This is instead of the standard log likelihood

681: loss $\log(1/p_t(x_t)$. While the log loss is easier to analyze, the

682: $L_1$ loss is often a more useful measure because the cumulative $L_1$

683: loss corresponds to the expected number of mistakes. While this loss

684: does not fit well in the maximal likelihood or Bayesian methodologies,

685: it fits NormalHedge very well, because the loss per-iteration is bounded.

686:

687: Here is our proposal for solving the prediction problem using

688: NormalHedge. We associate a set of experts with each hidden state. The

689: experts are confidence rated, i.e. each one of the experts outputs a

690: confidence level $0 \leq c \leq 1$ at each time step, the confidence

691: level is used in the confidence rated variant of NormalHedge described

692: in the previous section. If expert $i$ corresponds to a hidden state

693: $j$ then $c_i$ should be large when $S=j$ and low when $S \neq

694: j$. Suppose that the parameters of the HMM are known, then we can

695: associate a single expert with each hidden state and compute the

696: prediction and the confidence value of that expert using Bayes

697: formula.

698:

699: Now suppose that we don't know the parameter vector of the HMM but

700: that we know that the vector is one of $N$ possibilities. In this case

701: we associate $N$ experts with each hidden state and compute the

702: predictions and confidence value of each expert using Bayes Formula

703: for the corresponding parameter vector, the confidence value for each

704: state is the a-posteriori probability for that state.

705:

706: In this case the NormalHedge algorithm will quickly converge and give

707: most of the weight to the experts that correspond to the correct

708: parameter vector. Moreover, if none of the parameter vectors is

709: a correct description of the sequence distribution, it will converge

710: on the vector which causes the least regret, i.e. makes the smallest

711: number of mistakes.

712:

713: Contrast this with the Bayesian approach. If the true distribution

714: generating the data is not included in the set of models over which we

715: take the posterior average, and if the loss function in which we are

716: interested is not log-likelihood but rather number of mistakes. Then

717: the cumulative loss of the Bayesian average can be much larger than

718: that of the best model in the set.

719:

720: \section{Open problems}

721:

722: The most interesting open problem is to close the gap between the

723: upper bound and lower bound on the parameter $c$. We have a lower

724: bound of $c>1$ and an upper bound of $c=4$. If we consider the case

725: $\alpha \to 0$ we can reduce $c$ to $2$. However, the gap between

726: $c=1$ and $c=2$ remains.

727:

728: One promising direction of expansion is to consider the game in the

729: continuous time limit directly. This leads us naturally into

730: stochastic processes in continuous time such as Wiener

731: processes. Understanding the performance of NormalHedge in this

732: context might yield new methods for stochastic estimation and

733: stochastic control.

734:

735: %%--------------------------------------------------------------------}

736: \bibliography{bib} \bibliographystyle{alpha}

737: %%--------------------------------------------------------------------}

738:

739:

740: \appendix

741:

742: \section{Proof of main theorem}

743:

744: Recall, the cumulative discounted regret of action $i$ at time $t=j\alpha$,

745: $j \in \N$ is defined recursively by

746: \[ R_i(0) = 0, \quad R_i(t+\alpha) = (1-\alpha) R_i(t) + g_i(t) - g_A(t),

747: \]

748: where $g_i(t) \in [-\sqrt{\alpha}, +\sqrt{\alpha}]$ is the (scaled) gain of

749: action $i$ at time $t$, and $g_A(t) \in [-\sqrt{\alpha}, +\sqrt{\alpha}]$

750: is the (scaled) gain of the hedger at time $t$. We define $r_i(t) = (g_i(t)

751: - g_A(t)) / \sqrt{\alpha} \in [-2,+2]$ as the (unscaled) instantaenous

752: regret to action $i$ at time $t$. The central quantity of interest is the

753: \emph{average potential}

754: \[ \Psi(t) = \frac1N \sum_{i=1}^N \Phi(R_i(t)). \]

755: Recall, we use the definition of the potential function $\Phi$

756: in Equation~\eqref{eqn:potential} with $c = 4$.

757:

758: \begin{claim}

759: There exists a positive constant $C \leq 2.32$ such that if $\alpha <

760: 1/(800\ln CN)$, then the average potential is always bounded from above by

761: $C$; that is, $\Psi(j\alpha) < C$ for any $j \in \N$.

762: \end{claim}

763: \begin{proof}

764: Fix $j \in \N$ and let $t = j\alpha$.

765:

766: We will analyze the average $\Psi(t+\alpha) - \Psi(t)$ by considering

767: the averages over two separate groups:

768: \[ I_1 = \{ i : R_i(t) \leq 0 \} \quad \text{and} \quad I_2 = \{ i : R_i(t)

769: > 0 \}. \]

770:

771: Let $\Psi_k(t) = (1/|I_k|) \sum_{i \in I_k} \Phi(R_i(t))$ be the average

772: potential for $I_k$, $k = 1, 2$ (assume without loss of generality that

773: neither $I_k$ is empty). We'll show the following facts:

774: \begin{enumerate}

775: \item[(A):] $\Psi_1(t) = 1$ and $\Psi_1(t+\alpha) < 1 + (3/5)\alpha$;

776: \item[(B):] If $\Psi(t) < 2.32$, then $\Psi_2(t+\alpha) - \Psi_2(t) < (2/3)\alpha$;

777: \item[(C):] If $2.31 < \Psi(t) < 2.32$, then $\Psi(t+\alpha) < \Psi(t)$.

778: \end{enumerate}

779: These facts imply that the increase in average potential from $\Psi(t)$ to

780: $\Psi(t+\alpha)$ is always less than $(2/3)\alpha < 1/1200$, and that if

781: the average potential $\Psi(t)$ is strictly between $2.31$ and $2.32$,

782: then $\Psi(t+\alpha)$ is strictly less than $\Psi(t)$. The claim then

783: follows by induction because $\Psi(0) = 1$.

784:

785: We now prove the facts (A), (B), and (C).

786:

787: (A): For $i \in I_1$, $\Phi(R_i(t)) = 1$ and $R_i(t+\alpha) \leq (1-\alpha)

788: R_i(t) + |r_i(t)|\sqrt{\alpha} \leq 2\sqrt{\alpha}$. Since $\Phi(x)$ is

789: non-decreasing in $x$, we have $\Phi(R_i(t+\alpha)) \leq

790: \Phi(2\sqrt{\alpha}) = e^{\alpha/2} < 1+\alpha/2+\alpha^2e^{\alpha/2}/2 <

791: 1+(3/5)\alpha$ (the last inequality follows from the upper bound on

792: $\alpha$).

793:

794: (B): We address terms in $I_2$ by expanding $\Phi(R_i(t+\alpha))$ around

795: the point $R_i(t) \neq 0$ via Taylor's theorem:

796: \[ \Phi(R_i(t+\alpha)) = \Phi(R_i(t)) + d_i(t) \Phi'(R_i(t)) + \frac12

797: d_i(t)^2 \Phi''(\rho_i) \]

798: where $d_i(t) = r_i(t) \sqrt{\alpha} - \alpha R_i(t)$ and $\rho_i \in \R$

799: lies between $R_i(t)$ and $R_i(t+\alpha)$. Because the hedger's weights are

800: chosen so that $p_i(t) \propto \Phi'(R_i(t))$, we have that

801: \[ \sum_{i=1}^N g_i(t) \Phi'(R_i(t)) - g_A(t) \sum_{i=1}^N \Phi'(R_i(t)) =

802: 0 \]

803: and thus

804: \[ \Phi(R_i(t+\alpha)) - \Phi(R_i(t)) = -\alpha R_i(t) \Phi'(R_i(t)) +

805: \frac12 d_i(t)^2 \Phi''(\rho_i). \]

806: We need a few bounds before proceeding. First, if $\Psi(t) < 2.32$, then

807: $\Phi(R_i(t)) < 2.32N$ for all $i$, which implies $R_i(t) <

808: \sqrt{8\ln(2.32N)}$ for all $i$. By the condition on $\alpha$, we also have

809: $\sqrt{\alpha}R_i(t) < 1/10$. Next, we use a bound on $\rho_i$ since it is

810: evaluated in the non-decreasing function $\Phi''(x)$:

811: \[ \rho_i^2 \ \leq \ \max\{R_i(t), R_i(t+\alpha)\}^2 \ \leq

812: \ (R_i(t) + |r_i(t)|\sqrt{\alpha})^2

813: \ = \ R_i(t)^2 + 2\sqrt{\alpha}R_i(t)|r_i(t)| + \alpha r_i(t)^2

814: \ \leq \ R_i(t)^2 + \frac12. \]

815: Finally, we bound $d_i(t)^2$ as follows:

816: \[ d_i(t)^2 \ \leq \ (|r_i(t)|\sqrt{\alpha} + \alpha R_i(t))^2 \ \leq

817: \ (2+1/10)^2\alpha \ \leq \ \frac92 \alpha. \]

818: Altogether, we have

819: \begin{align*}

820: \Phi(R_i(t+\alpha)) - \Phi(R_i(t))

821: & = - \alpha R_i(t) \Phi'(R_i(t)) + \frac12 d_i(t)^2 \Phi''(\rho_i) \\

822: & = - \alpha R_i(t) \frac{R_i(t)}{4} e^{R_i(t)^2/8} + \frac12 d_i(t)^2

823: \left( \frac14 + \frac{\rho_i^2}{16} \right) e^{\rho_i^2/8} \\

824: & \leq - \alpha \frac{R_i(t)^2}{4} e^{R_i(t)^2/8} + \frac{9\alpha}{4}

825: \left( \frac9{32} + \frac{R_i(t)^2}{16} \right) e^{R_i(t)^2/8} e^{1/16} \\

826: & \leq \alpha \left(\frac23 - \frac1{10} R_i(t)^2\right) e^{R_i(t)^2/8}.

827: \end{align*}

828: The final bound is decreasing as a function of $R_i(t) \geq 0$. This

829: implies $\Phi(R_i(t+\alpha)) - \Phi(R_i(t)) \leq (2/3)\alpha$, so

830: $\Psi_2(t+\alpha) - \Psi_2(t) < (2/3)\alpha$.

831:

832: (C): First, consider the problem of maximizing

833: \[ f(x_1, \ldots, x_n) = \sum_{i=1}^n \left( \frac23 - \frac{x_i^2}{10}

834: \right) e^{x_i^2/8} \]

835: subject to the constraint $(1/n) \sum_{i=1}^n e^{x_i^2/8} \geq B$ for some

836: $B \geq 1$. Simple variational arguments imply that the maximum is attained

837: when $x_i = \sqrt{8\ln B}$ for all $i$. Therefore, following the argument

838: for (B), we have that if $\Psi_2(t) \geq B$ for some $B \geq 1$, then

839: \[ \Psi_2(t+\alpha) - \Psi_2(t) \leq \alpha \cdot B \cdot \left( \frac23 -

840: \frac45 \ln B \right). \]

841:

842: Let $p_1 = |I_1|/N$ and $p_2 = 1 - p_1$. Suppose

843: $\Psi(t) > 2.31$. Because $\Psi_1(t) = 1$, we have

844: \[ \Psi_2(t)

845: = \frac1{p_2} \left( \Psi(t) - p_1 \right) \geq \frac1{p_2} (2.31 - p_1)

846: \doteq B. \]

847:

848: Now we analyze the overall change in average potential. By (A), the

849: increase in average potential over $i \in I_1$ is less than

850: $(3/5)\alpha$. Then

851: \begin{align*}

852: \frac{\Psi(t+\alpha) - \Psi(t)}{\alpha}

853: & < p_1 \cdot \frac35 + p_2 \cdot B \cdot \left( \frac23 - \frac45 \ln B

854: \right) \\

855: & = p_1 \cdot \frac35 + (2.31 - p_1) \cdot \left( \frac23 - \frac45 \ln

856: \frac1{1-p_1} - \frac45 \ln (2.31 - p_1) \right).

857: \end{align*}

858:

859: The final RHS is decreasing as a function of $p_1 \geq 0$, so it is

860: maximized when $p_1 = 0$. Making this substitution, the RHS is negative,

861: and thus $\Psi(t+\alpha) < \Psi(t)$.

862: \end{proof}

863:

864:

865:

866:

867: %%\proof

868: %We define the {\em average potential}

869: %\[

870: %\Psi(t) \doteq {1 \over N} \sum_{i=1}^N \Phi \paren{\ctR{i}{t}}

871: %\]

872: %And study the time evolution of $\Psi(t)$.

873: %

874: %Clearly, $\Psi(0)=1$.

875: %

876: %We fix $\alpha$ and study the change in the average potential from

877: %$t=j\alpha$ to $t+\alpha = (j+1)\alpha$

878: %\begin{eqnarray}

879: %\lefteqn{\Psi(t+\alpha) - \Psi(t)} && \label{eqn:proof-1}\\

880: %&=&{1 \over N} \sum_i

881: %\Phi \paren{(1-\alpha)\ctR{i}{t} + \ctg{i}{t}-\ctg{A}{t} }

882: %- \Phi \paren{ \ctR{i}{t} }  \label{eqn:proof-2}

883: %\end{eqnarray}

884: %

885: %We use $\ctd{i}{t}$ to denote $\ctR{i}{t+\alpha} - \ctR{i}{t}$ and

886: %$\ctr{i}{t}$ to denote $(\ctg{i}{t}-\ctg{A}{t})/\sqrt{\alpha}$. It follows directly from the

887: %definitions that $\ctd{i}{t} = \ctr{i}{t} \sqrt{\alpha} - \ctR{i}{t} \alpha$

888: %and that $|\ctr{i}{t}| \leq 2$.

889: %

890: %We perform a Taylor expansion of Equation~\eqref{eqn:proof-2}:

891: %\begin{equation} \label{eqn:proof-3}

892: %\sum_i \Phi \paren{\ctR{i}{t+\alpha}} - \Phi \paren{\ctR{i}{t}}

893: %=\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) + \half \ctd{i}{t}^2 \Phi''(\ctR{i}{t'})

894: %\end{equation}

895: %For some $t'$, $t \leq t' \leq t+\alpha$.

896: %

897: %We start by analyzing the first term

898: %\[

899: %\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) =

900: %\sum_i \paren{\ctg{i}{t} - \ctg{A}{t} - \alpha \ctR{i}{t}} \Phi'(\ctR{i}{t})

901: %\]

902: %It follows from the definition of the NH algorithm that

903: %\[

904: %\ctg{A}{t} = \frac{\sum_i \ctg{i}{t} \Phi'(\ctR{i}{t})}{\sum_i \Phi'(\ctR{i}{t})}

905: %\]

906: %From which it follows that:

907: %\[

908: %\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) = - \alpha \sum_i \ctR{i}{t} \Phi'(\ctR{i}{t})

909: %\]

910: %

911: %As $\Phi''(x)$ is an increasing function of $x$ we can bound rewrite

912: %Equation~\eqref{eqn:proof-3} as follows:

913: %\begin{eqnarray} \label{eqn:proof-4}

914: %\lefteqn{\Psi(t+\alpha) - \Psi(t)}

915: %&& \\

916: %&\leq&

917: %{1 \over N}

918: %\sum_i  \half \paren{\ctr{i}{t}\sqrt{\alpha} - \ctR{i}{t} \alpha}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})

919: %       - \alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) \nonumber

920: %\\

921: %&=&

922: %{\alpha \over N}

923: %\paren{\sum_i  \half \paren{\ctr{i}{t} - \ctR{i}{t} \sqrt{\alpha}}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})

924: %       - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \label{eqn:proof-5}

925: %\end{eqnarray}

926: %

927: %We are now ready to use the assumption that $\alpha$ is small. We do

928: %this in two steps. First, we assume $\alpha \to 0$, which makes for a

929: %simpler argument. Then we come back and show that requiring

930: %$\alpha < {1 \over 800 \ln 2N}$ is sufficient.

931: %

932: %

933: %{\bf Infinitesimally small $\alpha$}\\

934: %We divide both sides of Equation~\eqref{eqn:proof-5} by $\alpha$ and take the limit when $\alpha \to 0$ to get

935: %\begin{eqnarray}

936: %{d \over dt} \Psi(t) &\leq&

937: %{1 \over N}

938: %\paren{\sum_i  \half \ctr{i}{t}^2 \Phi''(\ctR{i}{t})

939: %       - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \label{eqn:proof-6}

940: %\\

941: %&=&

942: %{1 \over N}

943: %\paren{\sum_{i; \ctR{i}{t}\geq 0}  \half \ctr{i}{t}^2

944: %               \paren{{1 \over c} +{\ctR{i}{t}^2 \over c^2}}

945: %               \exp \paren{\ctR{i}{t}^2 \over 2c}

946: %       - {\ctR{i}{t}^2 \over c}

947: %               \exp \paren{\ctR{i}{t}^2 \over 2c}

948: %}

949: %\nonumber

950: %\\

951: %&\leq&

952: %{1 \over N}

953: %\paren{\sum_{i; \ctR{i}{t}\geq 0}  2

954: %               \paren{{1 \over c} +{\ctR{i}{t}^2 \over c^2}}

955: %               \exp \paren{\ctR{i}{t}^2 \over 2c}

956: %       - {\ctR{i}{t}^2 \over c}

957: %               \exp \paren{\ctR{i}{t}^2 \over 2c}

958: %}

959: %\\

960: %& \leq &

961: %{1 \over cN}

962: %\paren{\sum_{i; \ctR{i}{t}\geq 0}

963: %               2 \exp \paren{\ctR{i}{t}^2 \over 2c}

964: %       + \paren{{2 \over c} - 1}

965: %            \ctR{i}{t}^2

966: %               \exp \paren{\ctR{i}{t}^2 \over 2c}

967: %}

968: %\\

969: %& \leq &

970: %{1 \over c}

971: %\paren{2 \Psi(t)

972: %       + {1 \over cN}\paren{{2 \over c} - 1}

973: %         \sum_i \ctR{i}{t}^2

974: %                \exp \paren{\ctR{i}{t}^2 \over 2c}

975: %}

976: %\end{eqnarray}

977: %plugging in the choice $c=4$ we get that

978: %\begin{equation}

979: %{d \over dt} \Psi(t) \leq

980: %{1 \over c}

981: %\paren{2 \Psi(t)

982: %       - {1 \over 2N}

983: %         \sum_i \ctR{i}{t}^2

984: %                \exp \paren{\ctR{i}{t}^2 \over 2c}

985: %}

986: %\end{equation}

987: %We now find a condition under which the difference on

988: %the RHS is negative. Assuming

989: %$\Psi(t)=(1/N)\sum_i \exp \paren{\ctR{i}{t}^2 \over 2c}=A$, it is easy to verify that

990: %$(1/N)\sum_i \ctR{i}{t}^2 \exp \paren{\ctR{i}{t}^2 \over 2c}$ is

991: %minimized when all of the regrets are equal, which implies that

992: %\[

993: %\ctR{i}{t} = \ctR{}{t} = \sqrt{2c\ln(A)} = \sqrt{8\ln(A)}

994: %\]

995: %If $A \geq 2$ then

996: %\[

997: %{1 \over 2N} \sum_i \ctR{i}{t}^2 \exp \paren{\ctR{i}{t}^2 \over 2c}

998: %\geq

999: %{1 \over 2} \ctR{}{t}^2 A

1000: %=

1001: %{1 \over 2}  8 \ln(A) A \geq 8 \ln(2) A > 2A

1002: %\]

1003: %Thus if $\Psi(t) \geq 2$, ${d \over dt} \Psi(t) <0$ and as

1004: %$\Psi(0)=1$ and $\Psi(t)$ is a continuous and differentiable function

1005: %of $t$, $\Psi(t) < 2$ for all $t$, which completes the proof for

1006: %$\alpha \to \infty$.

1007: %

1008: %{\bf small but finite $\alpha$}\\

1009: %We now show that requiring $\alpha < {1 \over 800 \ln 4N}$ is

1010: %sufficient. As $\alpha>0$ is fixed we can use induction over the

1011: %iterations of the game.  We divide the range $[0,4]$ into two

1012: %sub-ranges: $[0,3]$ and $(3,4)$. We show that, for every iteration $j$

1013: %\begin{enumerate}

1014: %\item If $\Psi(j \alpha) \leq 3$ then $\Psi((j+1)\alpha) <4$.

1015: %\item If $3 < \Psi(j \alpha) < 4$ then $\Psi((j+1) \alpha) < \Psi(j

1016: %  \alpha)$.

1017: %\end{enumerate}

1018: %Using induction over these two conditional statements we get that

1019: %$\Psi(j) \leq 4$ for all $j$.

1020: %

1021: %We will now prove the two conditional statements.

1022: %

1023: %We first prove an upper bound on $\ctR{i}{t}$ conditioned on the

1024: %assumption that $\Psi(t)<4$. As the potential is non-negative we get that

1025: %\[

1026: %\forall i,\;\; \Phi \paren{\ctR{i}{j\alpha}} \leq 4N

1027: %\]

1028: %From which it follows that

1029: %\[

1030: %\forall i,\;\; \ctR{i}{j \alpha} \leq \sqrt{8 \ln 4N}

1031: %\]

1032: %\newcommand{\Rm}{R_m}

1033: %We use $\Rm$ to denote $\sqrt{8 \ln 4N}$ and note that requiring

1034: %$\alpha < {1 \over 800 \ln 4N}$ implies that $\sqrt{\alpha} \ctR{i}{t}

1035: %< 1/10$.

1036: %

1037: %We now expand Equation~\eqref{eqn:proof-4} without taking the limit $\alpha \to 0$.

1038: %\begin{eqnarray}

1039: %\lefteqn{\Psi(t+\alpha) - \Psi(t)}

1040: %&& \\

1041: %&\leq&

1042: %{\alpha \over N}

1043: %\sum_i  \paren{ \half \paren{\ctr{i}{t} - \ctR{i}{t} \sqrt{\alpha}}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})

1044: %       - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \\

1045: %&\leq&

1046: %{\alpha \over N} \left[

1047: %\sum_{i; \ctR{i}{t}\geq -2\sqrt{\alpha}}  \paren{

1048: %  \half \paren{\ctr{i}{t} + \sqrt{\alpha}\ctR{i}{t}}^2

1049: %  \paren{{1 \over c} +

1050: %         {\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2 \over c^2}}

1051: %  \exp\paren{\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2\over 2c}

1052: %  } \right. \\

1053: %&& \left.

1054: %-\sum_{i; \ctR{i}{t}\geq 0} \paren{

1055: %   {\ctR{i}{t}^2 \over c} \exp \paren{\ctR{i}{t}^2 \over 2c}

1056: %  }

1057: %\right]

1058: %\nonumber

1059: %\end{eqnarray}

1060: %We first consider the terms in the first sum which have no matching

1061: %terms in the second sum. In other words, indices $i$ for which

1062: %$-2\sqrt{\alpha} \leq \ctR{i}{t} <0$. It is easy to show that because

1063: %$c=4$, $\sqrt{\alpha} \ctR{i}{t}< 1/10$, $|\ctr{i}{t}|<2$, these terms

1064: %are smaller than $1$. As for these terms $\Phi(\ctR{i}{t}=0$, the

1065: %result is that  $\Phi(\ctR{i}{t+\alpha})<1$. Thus these terms cannot

1066: %increase the overall average potential at $t+\alpha$ beyond $1$ and so

1067: %we can ignore them.

1068: %

1069: %We thus consider only terms for which $\ctR{i}{t}\geq 0$ so that there

1070: %is a term in both sums. Using the facts that $c=4$, $\sqrt{\alpha}

1071: %\ctR{i}{t}< 1/10$, $|\ctr{i}{t}|<2$ and $\alpha < 1/800$ we can show

1072: %that

1073: %\begin{eqnarray}

1074: %\lefteqn{

1075: %\sum_{i; \ctR{i}{t}\geq 0}

1076: %  \half \paren{\ctr{i}{t} + \sqrt{\alpha}\ctR{i}{t}}^2

1077: %  \paren{{1 \over c} +

1078: %         {\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2 \over c^2}}

1079: %  \exp\paren{\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2\over 2c}

1080: %- {\ctR{i}{t}^2 \over c} \exp \paren{\ctR{i}{t}^2 \over 2c}

1081: %} && \nonumber \\

1082: %& \leq &

1083: %\sum_{i; \ctR{i}{t}\geq 0}

1084: % \half (2.1)^2

1085: %\paren{{1 \over 4}+{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 16}}

1086: %\exp \paren{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 8}

1087: %-{\ctR{i}{t}^2 \over 4} \exp \paren{\ctR{i}{t}^2 \over 8}

1088: %\nonumber \\

1089: %& \leq &

1090: %\sum_{i; \ctR{i}{t}\geq 0}

1091: %\brackets{

1092: % \half (2.1)^2

1093: %\paren{{1 \over 4}+{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 16}}

1094: %\exp \paren{\ctR{i}{t}\sqrt{\alpha} + \alpha \over 2}

1095: %-{\ctR{i}{t}^2 \over 4}

1096: %}

1097: %\exp \paren{\ctR{i}{t}^2 \over 8}

1098: %\nonumber \\

1099: %& \leq &

1100: %\sum_{i; \ctR{i}{t}\geq 0}

1101: %\brackets{0.6 - 0.1 \ctR{i}{t}^2}

1102: %\exp \paren{\ctR{i}{t}^2 \over 8}

1103: %\nonumber

1104: %\end{eqnarray}

1105: %For a fixed value of $\Psi(t)$, this sum is maximized when

1106: %$\ctR{i}{t}$ are all equal to $\sqrt{\ln \Psi{t}}$. Claim 1. above

1107: %follows from the fact that the increase in the average potential in a

1108: %single step is at most $0.6$. Claim 2. follows from the fact that if

1109: %$\Psi{t} \geq 3$ then $\sqrt{\ln \Psi{t}} \geq 2.9$ and setting

1110: %$\ctR{i}{t}=2.9$ for all $i$ we find that $\Psi(t+\alpha) < \Psi(t)$.

1111: %

1112: %%\qed

1113: %

1114: \end{document}

1115:

1116: