1: \documentclass{article}
2: \usepackage[letterpaper,margin=1in]{geometry}
3: \usepackage{amsfonts}
4: \usepackage{amsmath}
5: \usepackage{amssymb}
6: \usepackage{srcltx}
7: \usepackage{graphicx}
8:
9:
10: \def\jump{\vskip 0.05in}
11: \def\qed{\vrule height 7pt width 3pt depth 0pt}
12: \newenvironment{proof}{\noindent{\it Proof.} }{\qed\jump}
13: \newenvironment{sketch}{\noindent{\it Proof sketch.} }{\qed\jump}
14: \newtheorem{theorem}{Theorem}
15: \newtheorem{claim}{Claim}
16:
17: \def\N{{\mathbb{N}}}
18: \def\R{{\mathbb{R}}}
19: \def\wh{\widehat}
20:
21: \input{macros}
22:
23: \begin{document}
24: \title{A new Hedging algorithm and its application to inferring latent random variables}
25:
26: \author{Yoav Freund and Daniel Hsu \\
27: {\tt \{yfreund,djhsu\}@cs.ucsd.edu}}
28:
29: \maketitle
30:
31: \newcommand{\vp}{\vec{p}}
32: \newcommand{\dt}{\Delta t}
33:
34: \newcommand{\ctp}[2]{p_{#1}\paren{#2}} % Continuous Time p
35: \newcommand{\ctg}[2]{g_{#1}\paren{#2}} % Continuous Time g_i
36: \newcommand{\ctga}[1]{g_A\paren{#1}} % Continuous Time g_A
37: \newcommand{\ctR}[2]{R_{#1}\paren{#2}} % Continuous Time Regret
38: \newcommand{\ctr}[2]{r_{#1}\paren{#2}}
39: \newcommand{\ctd}[2]{d_{#1}\paren{#2}}
40:
41: \begin{abstract}
42:
43: We present a new online learning algorithm for cumulative discounted
44: gain. This learning algorithm does not use exponential weights on
45: the experts. Instead, it uses a weighting scheme that depends on the
46: regret of the master algorithm relative to the experts. In
47: particular, experts whose discounted cumulative gain is smaller
48: (worse) than that of the master algorithm receive zero weight.
49: We also sketch how a regret-based algorithm can be used as an
50: alternative to Bayesian averaging in the context of inferring latent
51: random variables.
52:
53: \end{abstract}
54:
55: \section{Introduction} \label{sec:introduction}
56:
57: We study a variation on the online allocation problem presented by
58: Freund and Schapire in~\cite{FreundSc97}. Our problem varies from the
59: original in that we use {\em discounted} cumulative loss instead of
60: regular cumulative loss. Specifically, we consider the following
61: iterative game between a {\em hedger} and {\em Nature}.
62:
63: In this setting, there are $N$ actions (e.g.~strategies, experts) indexed
64: by $i$. The game between the hedger and Nature proceeds in iterations
65: $j=0,1,2,\ldots$. In the $j$th iteration:
66: \begin{enumerate}
67: \item The hedger chooses a distribution $\{p_i^j\}_{i=1}^N$ over the
68: actions, where $p_i^j \geq 0$ and $\sum_{i=1}^N p_i^j = 1$.
69: \item Nature associates a gain $g_i^j \in [-1,1]$ with action $i$.
70: \item The gain of the hedger is $g^j_A = \sum_{i=1}^N p_i^j g_i^j$.
71: \end{enumerate}
72: We define the {\em discounted total gain} as follows. The initial
73: total gain is zero $G_i^0 = 0$. The total gain for action $i$ at the start
74: of iteration $j+1$ is defined inductively as:
75: \[
76: G_i^{j+1} \doteq (1 - \alpha) G_i^j + g_i^j
77: \]
78: for some fixed {\em discount factor} $\alpha>0$.
79: The discounted total loss of the hedger is similarly
80: defined:
81: \[
82: G_A^0=0, \quad G_A^{j+1} \doteq (1 - \alpha) G_A^j + g_A^j~.
83: \]
84: We define the {\em regret} of the hedger with respect to action $i$ at the
85: start of iteration $j$ as
86: \[
87: R_i^j \doteq G_i^j - G_A^j
88: \]
89: It is easy to see that the regret obeys the following recursion:
90: \[
91: R_i^0=0, \quad R_i^{j+1} = (1 - \alpha) R_i^j + g_i^j - g_A^j~.
92: \]
93: Our goal is to find a hedging algorithm for which we can show a small uniform
94: upper bound on the regret, i.e. a small positive real number
95: $B(\alpha)$ such that $R_i^j \leq B(\alpha)$ for all choices of Nature, all
96: $i$ and all $j$.
97:
98: Our new hedging algorithm, which we call {\bf NormalHedge}, uses the
99: following weighting:
100: \begin{equation} \label{eqn:Hedge-distribution}
101: w_i^j \doteq
102: \begin{cases}
103: R_i^j \exp \paren{\frac{\alpha \brackets{R_i^j}^2}{8}} & \text{if $R_i^j >
104: 0$} \\
105: 0 & \text{if $R_i^j \leq 0$.}
106: \end{cases}
107: \end{equation}
108: The hedging distribution is equal to the normalized weights
109: $p_i^j = w_i^j / \sum_{k=1}^N w_k^j$ unless all of the weights
110: are zero, in which case we use the uniform distribution $p_i^j = 1/N$.
111:
112: Our main result is that if $\alpha$ is sufficiently small, the
113: following inequality holds uniformly over all game histories:
114: \begin{equation*}
115: {1 \over N} \sum_{i=1}^N \Phi\paren{\sqrt{\alpha} R_i^j} < 2.32
116: \end{equation*}
117: where
118: \[
119: \Phi\paren{x} =
120: \begin{cases}
121: \exp \paren{\frac{x^2}{8}} & \text{if $x > 0$} \\
122: 1 & \text{if $x \leq 0$.}
123: \end{cases}
124: \]
125: This implies, in particular, that for any $i$ and $j$,
126: \[
127: R_i^j \leq \sqrt{\frac{8 \ln 2.32N}{\alpha}}.
128: \]
129:
130: The discount factor $\alpha$ plays a similar role to the number of
131: iterations in the standard undiscounted cumulative loss framework.
132: %In
133: %order to compare this results to the results in~\cite{FreundSc97} we
134: %set $\alpha = 1/T$ and get that (for sufficiently large $T$) $R_i^j
135: %\leq \sqrt{8 T \ln 4N}$. This is very similar to the bound $R_i^T \leq
136: %\sqrt{2T \ln N}+\ln N$ on the loss of the exponential weights {\bf
137: % Hedge} algorithm given in~\cite{FreundSc97} (Equation 11). The
138: %important difference is that the bounds for NormalHedge hold for any
139: %step $j$ while the bound for Hedge holds only for $j=T$. This is a
140: %significant improvement of NormalHedge over Hedge. In order to set the
141: %learning rate parameter $\beta$ in Hedge we need an a-priori upper
142: %bound on the total loss of the best expert. In NormalHedge the
143: %learning rate $\alpha$ is the discount factor we wish to use, and
144: %setting it does not require an a-priori upper bound on the total
145: %(discounted) loss of the best expert.
146: Indeed, it is easy to transform the usual exponential weights algorithms
147: from the standard framework (e.g.~Hedge \cite{FreundSc97}) to our
148: present setting (Section~\ref{sec:comparison}). Such algorithms also
149: enjoy discounted cumulative regret bounds of
150: \[ R_i^j \leq C \cdot \sqrt{\frac{\ln N}{\alpha}} \]
151: for some positive constant $C$, but they require knowledge of the number of
152: actions $N$ to tune a learning parameter. The tuning of NormalHedge does
153: not have this requirement\footnote{The guarantees afforded to NormalHedge
154: require $\alpha$ to be sufficiently smaller than $1/\ln N$, but this
155: restriction is operationally different from needing to know $N$ in
156: advance.}.
157:
158: The rest of this paper is organized as follows. In
159: Section~\ref{sec:drifting-game} we describe the main ideas behind the
160: construction and analysis of NormalHedge. In
161: Sections~\ref{sec:comparison} and \ref{sec:hedge} we discuss related work and compare NormalHedge to exponential
162: weights algorithms. Finally, in Section~\ref{sec:latent} we suggest
163: how to use NormalHedge to track latent variables and sketch how that
164: might be used for learning HMMs under the $L_1$ loss.
165:
166: \section{NormalHedge} \label{sec:drifting-game}
167:
168: \subsection{Preliminaries}
169:
170: NormalHedge and its analysis are based on the potential function
171: $\Phi(x)$ introduced in Section~\ref{sec:introduction}. Here
172: we give a slightly more elaborate definition for $\Phi(x)$ that includes a
173: constant $c$. The potential function is a %twice-differentiable
174: non-decreasing function of $x \in \R$
175: \begin{equation} \label{eqn:potential}
176: \Phi(x) \doteq \begin{cases}e^{x^2/2c} & \text{if $x > 0$} \\
177: 1 & \text{if $x \leq 0$}
178: \end{cases}
179: \end{equation}
180: where $c > 1$. In our current version of NormalHedge, $c=4$. Decreasing
181: $c$ will improve the bound on the regret; we will also argue that $c$
182: cannot be decreased to $1$.
183:
184: The weights assigned by NormalHedge are set proportional to the first
185: derivative of $\Phi$, i.e.~$w_i^j = \Phi'(R_i^j)$, where
186: $$ \Phi'(x) = \begin{cases}{x \over c}e^{x^2/2c} & \text{if $x > 0$} \\
187: 0 & \text{if $x \leq 0$.}
188: \end{cases} $$
189: In our analysis, we will also need to examine the second derivative of
190: $\Phi$:
191: $$ \Phi''(x) = \begin{cases}\paren{{1 \over c} + {x^2 \over c^2}}e^{x^2/2c}
192: & \text{if $x > 0$} \\
193: 0 & \text{if $x < 0$.}
194: \end{cases} $$
195: Note that $\Phi''(x)$ has a discontinuity at $x=0$.
196:
197: \subsection{An intuitive derivation}
198:
199: The intuition behind the potential function is based on considering
200: the following strategy for Nature. Suppose there are two types of
201: actions, {\em good} actions and {\em poor} actions. The gain for each
202: action on each iteration is chosen independently at random from a
203: distribution over $\{-1,+1\}$. The distribution for poor actions has
204: equal probabilities $1/2,1/2$ on the two outcomes, while the
205: distribution for the good experts is $(1+\gamma)/2$ on $+1$ and
206: $(1-\gamma)/2$ on $-1$ for some very small $\gamma>0$. Clearly, the
207: best hedging strategy is to put equal positive weights on the good
208: actions and zero weight on the poor actions. Unfortunately, the
209: hedging algorithm does not know at the beginning of the game which
210: experts are good, so it has to learn these weights online. Assuming
211: that the number of actions is infinite (or sufficiently large), the
212: per-iteration gain of the optimal weighting is $\gamma$, which implies
213: that the discounted cumulative gain of this strategy is $\gamma/\alpha$.
214:
215: Consider the regrets of this optimal hedging with respect to the good
216: actions. It is not hard to show that the expected value of the
217: discounted cumulative gain of a good action is $\gamma/\alpha$ and
218: that the variance is approximately $1/\alpha$ (becomes exact as
219: $\gamma \to 0$). Moreover, if $\alpha \to 0$ this distribution
220: approaches a {\em normal} distribution with mean $\gamma/\alpha$ and
221: variance $1/\alpha$. In other words, the distribution of the regrets
222: of optimal hedging with respect to the good actions is
223: $(1/Z)\exp(-\alpha R^2/2)$.
224:
225: Consider the expected value of the potential function
226: $\Phi(\sqrt{\alpha}R)$ for this distribution over the regrets. If we
227: set $c=1$ we find that the product of the probability of the regret
228: $R$ and the potential for the regret $R$ is a constant independent of
229: $R$:
230: $$ \frac1Z \cdot \exp\left(-\frac{\alpha R^2}{2} \right) \cdot \exp\left(\frac{\alpha
231: R^2}{2} \right) = \frac1Z = \Omega(1). $$
232: Thus the expected potential is infinite. However, if we set $c$ to be
233: larger than $1$ then the expected value of the potential function becomes
234: finite. Thus, roughly speaking, the potential associated with a regret
235: value is the reciprocal of the probability of that regret value being a
236: result of random fluctuations. This level of regret is unavoidable. The
237: design of NormalHedge is based on the goal of not allowing the average
238: regret to grow beyond this level that is generated by random fluctuations.
239: Ideally, we would be able to use a potential function with any constant $c$
240: larger than 1. However, what we are able to prove is that the algorithm
241: works for $c=4$.
242:
243: The idea of NormalHedge is to keep the average potential small. It is
244: therefore natural that the weight assigned to each action is proportional
245: to the derivative of the potential. Indeed, it is easily checked that the
246: weights $w_i^j$ defined in Equation~\eqref{eqn:Hedge-distribution} are
247: proportional to $\Phi'(\sqrt{\alpha}R_i^j)$.
248: This derivative, however, is best viewed when the hedging game is mapped
249: into continuous time.
250: %However, in order to make this
251: %use of a derivative sensible, we need to map the hedging game into
252: %continuous time.
253:
254: \subsection{The continuous time limit}
255: Our analysis of NormalHedge is based on mapping the integer time steps
256: $j=0,1,2,\ldots$ into real-valued time steps $t=0,\alpha,2\alpha,\ldots$
257: and then taking the limit $\alpha \to 0$. Formally, we redefine
258: the hedging game using a different notation which uses the real valued
259: time $t$ instead of the time index $j$. We assume a set of $N$
260: actions (experts), indexed by $i$. The game between the hedging
261: algorithm and Nature proceeds in iterations
262: $t=0,\alpha,2\alpha,\ldots$. At each iteration the
263: following sequence of actions take place.
264:
265: \begin{enumerate}
266: \item The hedging algorithm defines a distribution $\braces{\ctp{i}{t}}_{i=1}^N$ over the
267: actions. $\ctp{i}{t} \geq 0;\;\; \sum_{i=1}^N \ctp{i}{t} = 1$.
268: \item Nature associates a gain $\ctg{i}{t} \in [-\sqrt{\alpha},+\sqrt{\alpha}]$ with action $i$.
269: \item The gain of the hedger is $\ctga{t} = \sum_{i=1}^N \ctp{i}{t} \ctg{i}{t}$.
270: \end{enumerate}
271:
272: We skip the definitions of $G_i(t)$ and $G_A(t)$ as these can become
273: ill-behaved when $\alpha \to 0$. Instead we define the regret directly:
274: \[
275: \ctR{i}{0}=0,\;\; \ctR{i}{t+\alpha} = (1- \alpha)\ctR{i}{t} + \ctg{i}{t} - \ctga{t}~.
276: \]
277: Note that this definition of the regret is a scaled version of the
278: discrete time regret:
279: \[
280: \ctR{i}{j\alpha} = \sqrt{\alpha} R_i^j.
281: \]
282:
283: We now have the tools needed to prove our main result.
284: \begin{theorem} \label{thm:main}
285: There exists a positive constant $C < 2.32$ such that if $\alpha < 1/(800
286: \ln CN)$, then for any sequence of gains and any iteration $j$
287: $$ \frac1N \sum_{i=1}^N \Phi\paren{\sqrt{\alpha} R_i^j} < C. $$
288: \end{theorem}
289: %The proof of the theorem is given in the appendix.
290: %
291: \begin{sketch}
292: The full proof is given in the appendix, but here we sketch a
293: continuous-time argument (i.e.~we consider $\alpha \to 0$). The formal,
294: discrete-time proof shows that it is enough for $\alpha \leq 1/(800 \ln
295: CN)$.
296:
297: We want to show that the average potential
298: \[ \Psi(t) \doteq \frac1N \sum_{i=1}^N \Phi(t) \]
299: is bounded for all time $t$. Our approach is to show that its
300: time-derivative
301: \[ \frac{\partial}{\partial t} \Psi(t) = \lim_{\alpha \to 0} \frac1\alpha
302: \cdot \frac1N \sum_{i=1}^N \left\{ \Phi(\ctR{i}{t+\alpha}) -
303: \Phi(\ctR{i}{t}) \right\} \]
304: becomes non-positive as soon as $\Psi(t)$ is above some constant (recall
305: that the time steps are in increments of $\alpha$). Since the $\Phi(x)$ is
306: constant for $x < 0$, we need only consider $i$ such that
307: $\ctR{i}{t+\alpha} \geq 0$. Ignoring the discontinuity of $\Phi''(x)$ at
308: $x=0$, Taylor's theorem implies that for some $\rho_i \leq
309: \max\{\ctR{i}{t}, \ctR{i}{t+\alpha}\}$,
310: \begin{eqnarray*}
311: \sum_{i: \ctR{i}{t} \geq 0}
312: \Phi(\ctR{i}{t+\alpha}) - \Phi(\ctR{i}{t})
313: & = & \sum_{i: \ctR{i}{t} \geq 0} \Phi((1-\alpha) \ctR{i}{t} + g_i(t) - g_A(t)) - \Phi(\ctR{i}{t}) \\
314: & = & \sum_{i: \ctR{i}{t} \geq 0} (-\alpha \ctR{i}{t} + g_i(t) - g_A(t))
315: \Phi'(\ctR{i}{t}) \\
316: & & \quad \quad \quad \mbox{} + \frac12 (g_i(t) - g_A(t) - \alpha \ctR{i}{t})^2 \Phi''(\rho_i) \\
317: & \leq & \sum_{i: \ctR{i}{t} \geq 0} -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) + \frac12 (g_i(t) - g_A(t) -
318: \alpha \ctR{i}{t})^2 \Phi''(\rho_i) \\
319: & \leq & \sum_{i: \ctR{i}{t} \geq 0} -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) +
320: \frac12 (2\sqrt{\alpha} + \alpha
321: \ctR{i}{t})^2 \Phi''(\ctR{i}{t} + 2\sqrt{\alpha}).
322: \end{eqnarray*}
323: The first inequality uses the fact that the weights are proportional to the
324: derivatives of the potentials
325: $$ \sum_{i: \ctR{i}{t} \geq 0} g_i(t) \cdot
326: \frac{\Phi'(\ctR{i}{t})}{\sum_{j: \ctR{j}{t} \geq 0} \Phi'(\ctR{j}{t})} =
327: g_A(t), $$
328: and the second inequality follows because $|g_i(t) - g_A(t)| \leq
329: 2\sqrt{\alpha}$. Now dividing by $\alpha$ and $N$ and taking the limit
330: $\alpha \to 0$, we have
331: \begin{eqnarray*}
332: \frac{\partial}{\partial t} \Psi(t)
333: & \leq & \lim_{\alpha \to 0} \frac1\alpha \cdot \frac1N \sum_{i: \ctR{i}{t}
334: \geq 0} \frac12 (2\sqrt{\alpha} + \alpha \ctR{i}{t})^2 \Phi''(\ctR{i}{t} +
335: 2\sqrt{\alpha}) -\alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) \\
336: & = & \lim_{\alpha \to 0} \frac1N \sum_{i: \ctR{i}{t}
337: \geq 0} \frac12 (2 + \sqrt{\alpha}\ctR{i}{t})^2 \Phi''(\ctR{i}{t} +
338: 2\sqrt{\alpha}) - \ctR{i}{t} \Phi'(\ctR{i}{t}) \\
339: & = & \frac1N \left\{ 2 \left( \frac1c + \frac{\ctR{i}{t}^2}{c^2} \right)
340: \exp(\ctR{i}{t}^2/2c) - \frac{\ctR{i}{t}^2}{c} \exp(\ctR{i}{t}^2/2c) \right\}
341: \\
342: & \leq & \frac2c \Psi(t) + \frac1{cN} \sum_{i=1}^N \left( \frac2c - 1 \right)
343: \ctR{i}{t}^2 \exp(\ctR{i}{t}^2/2c).
344: \end{eqnarray*}
345: If $\Psi(t) \geq B$, then this final RHS is maximized when $R_i(t) \equiv
346: \sqrt{2c\ln B}$ for all $i$, whereupon
347: $$ \frac{\partial}{\partial t} \Psi(t) \leq \frac{2B}{c} \left( 1 + \left(
348: \frac2c - 1\right) c \ln B \right). $$
349: This is non-positive for sufficiently large $B$ and $c \geq 2 + 1/\ln B$.
350: \end{sketch}
351:
352: \section{Related work} \label{sec:comparison}
353:
354: \subsection{Relation to other online learning algorithms}
355:
356: The Hedge algorithm~\cite{FreundSc97}, as well as most of the work on
357: online learning algorithms is based on exponential weighting, where
358: the weight assigned to an expert is exponential in the cumulative loss
359: of that expert. NormalHedge uses a very different weighting
360: scheme. The most important difference is that the weight of an expert
361: depends on the regret of the master algorithm relative to that expert,
362: rather than just on the loss of the algorithm. In particular, experts
363: whose discounted cumulative loss is larger than that of the master
364: algorithm receive zero weight. We expand on the comparison of NormalHedge
365: to Hedge in Section~\ref{sec:hedge}.
366:
367: The starting point for the derivation and analysis of NormalHedge is
368: the Binomial Weights algorithm of Cesa-Bianchi et
369: al~\cite{CesabianchiFrHeWa96}. The Binomial weights algorithm is an
370: algorithm for a restricted version of the experts prediction
371: problem~\cite{LittlestoneWa94,CesabianchiFrHeHaScWa97}. In this version
372: sequence to be predicted is binary and all of the predictions are also
373: binary. The Binomial Weights algorithm is analyzed using a type of
374: {\em chip game}. In this game each expert is represented as a chip, at
375: each iteration each chip has a location on the integer line. The
376: position of the chip corresponds to the number of mistakes that were
377: made by the expert. The a-priori assumption is that there is at least
378: one experts which makes at most $k$ mistakes, and the goal is to
379: define a rule for combining the experts predictions in a way that
380: would minimize the maximal number of mistakes of the master expert.
381:
382: The chip game analysis leads naturally to the definition of the {\em
383: potential function} and the evolution of this potential function
384: from iteration to iteration yields the Binomial Weights algorithm. A
385: closely related notion of potential was used in the Boost-by-Majority
386: algorithm. The chip-game analysis was extended by Schapire's work on
387: drifting games~\cite{Schapire01} and by Freund and Opper's work on drifting
388: games in continuous time~\cite{FreundOp02}. NormalHedge naturally extends
389: the continuous time drifting games to a setting in which one seeks to
390: minimize discounted loss.
391:
392: \subsection{Relation to switching and sleeping experts}
393:
394: The use of discounted cumulative loss represents an alternative to the
395: ``switching experts'' framework of Warmuth and
396: Herbster~\cite{HerbsterWa98}. If the best expert changes at a rate of
397: $O(\alpha)$, then NormalHedge
398: will switch to the new best expert because the losses that occurred more
399: than $1/\alpha$ iterations ago make a small contribution to the discounted
400: total loss.
401:
402: A useful extension of NormalHedge is to using experts that can
403: abstain, similar to the setup studied in ~\cite{FreundScSiWa97}. To do this we assume
404: that each expert $i$, at each iteration $j$, outputs a confidence
405: level $0 \leq c \leq 1$. Instead of using the vector
406: $\{p_i^j\}_{j=1}^N$ the hedger uses the vector $\{p_i^j
407: c_i^j/Z^j\}_{j=1}^N$ where $Z = \sum_{i=1}^N p_i^j c_i^j$. The gain
408: $g_i^j$ of action $i$ at iteration $j$ is replaced by $c_i^jg_i^j$,
409: and the discounted cumulative gain and the discounted cumulative
410: regret change in the corresponding way. The bounds on the average
411: potential transfer without change. This allows an expert to abstain
412: from making a prediction. By setting $c_i^j=0$ the expert effectively
413: removes itself from the pool of experts used by the hedger. It also
414: avoids suffering any loss. However, an expert cannot always abstain,
415: because then it's discounted cumulative gain will be driven to zero by
416: the discount factor.
417: We will use this extension in Section~\ref{sec:latent}.
418:
419: \section{Comparison of NormalHedge and Hedge} \label{sec:hedge}
420:
421: \subsection{Discounted regret bound for Hedge}
422:
423: To ease the comparison, we first recast the Hedge algorithm
424: \cite{FreundSc97} into our current framework with discounted gains. The
425: weights used by Hedge are
426: $$ w_i^j \doteq \exp(\eta G_i^j) $$
427: where $G_i^j$ is the discounted cumulative gain of action $i$ at the
428: start of iteration $j$, and $\eta > 0$ is the learning rate parameter.
429: When written recursively as
430: $$ w_i^{j+1} = \exp\left(\eta ((1-\alpha) G_i^j + g_i^j) \right)
431: \propto \left( w_i^j \right)^{1-\alpha} \exp(\eta g_i^j), $$
432: we see that the effect of discounting is a dampening of the previous
433: weights $w_i^j$ prior to the usual multiplicative update rule.
434:
435: Fix any iteration $j$ and define the adjusted cumulative gain of action $i$
436: at the start of iteration $k$ to be
437: \[ \wh G_i^k = \sum_{s=1}^{k-1} (1-\alpha)^{j-1-s} g_i^s \]
438: with $G_i^0 = 0$. The gain of Hedge in iteration $k$ is
439: \[ g_A^k =
440: \frac{\sum_{i=1}^N w_i^k g_i^k}{\sum_{i=1}^N w_i^k}
441: = \frac{\sum_{i=1}^N e^{\eta \wh G_i^k}
442: g_i^k}{\sum_{i=1}^N e^{\eta \wh G_i^k}}
443: \]
444: and the adjusted cumulative gain of Hedge at the start of iteration $k$ is
445: \[ \wh G_A^k = \sum_{s=1}^{k-1} (1-\alpha)^{j-1-s} g_A^s. \]
446: Then the discounted cumulative regret to action $i$ at the start of
447: iteration $j$ is $\wh G_i^j - \wh G_A^j$.
448:
449: We analyze the (log of the) ratios $W_k / W_{k-1}$, where
450: \[ W_k = \sum_{i=1}^N e^{\eta \wh G_i^k} \]
451: and $W_0 = N$. We lower bound $\ln(W_j / W_0)$ as
452: \[ \ln \frac{W_j}{W_0} = \ln \sum_{i=1}^N e^{\eta \wh G_i^j} - \ln N \geq
453: \ln e^{\eta \wh G_i^j} - \ln N = \eta \wh G_i^j - \ln N \]
454: (for any $i$), and we upper bound it as
455: \begin{align*}
456: \ln \frac{W_j}{W_0}
457: & = \sum_{k=1}^{j-1} \ln \frac{W_j}{W_{j-1}} \\
458: & = \sum_{k=1}^{j-1} \ln \frac{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}} e^{\eta
459: (1-\alpha)^{j-1-k} g_i^k}}{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}}} \\
460: & \leq \sum_{t=1}^T \eta \cdot \frac{\sum_{i=1}^N e^{\eta \wh G_i^{k-1}}
461: (1-\alpha)^{j-1-k} g_i^k}{\sum_{i=1}^N e^{\eta G_i^{k-1}}} +
462: \frac{\eta^2}{8} \cdot 4(1-\alpha^{2(j-1-k)} \quad \text{(Hoeffding's
463: inequality)} \\
464: & = \sum_{k=1}^{j-1} \eta (1-\alpha)^{j-1-k} g_A^k + \frac{\eta^2}{2}
465: (1-\alpha)^{2(j-1-k)} \\
466: & = \eta \wh G_A^k + \frac{\eta^2}{2} \cdot \frac{1}{1-(1-\alpha)^2} \\
467: & = \eta \wh G_A^k + \frac{\eta^2}{4(\alpha - \alpha^2/2)}.
468: \end{align*}
469: Therefore, the discounted cumulative regret of Hedge to action $i$ at the
470: start of any iteration $j$ is
471: $$ R_i^j = \wh G_i^j - \wh G_A^j \leq \frac{\ln N}{\eta}
472: + \frac{\eta}{4(\alpha - \alpha^2/2)}. $$
473: Choosing $\eta = \sqrt{4(\alpha - \alpha^2/2) \ln N}$ gives
474: $$ R_i^j \leq \sqrt{\frac{\ln N}{\alpha - \alpha^2/2}}. $$
475:
476: The regret bound is of the same form as that implied by
477: Theorem~\ref{thm:main}, indeed, with better leading constants. However,
478: this bound only holds when $\eta$ is tuned with knowledge of the number of
479: actions $N$. If instead one sets $\eta = \Theta(\sqrt{\alpha})$
480: independently of $N$, the bound for Hedge is worse by a factor of
481: $\Theta(\sqrt{\ln N})$. Furthermore, this setting of $\eta$ is for
482: optimizing a bound that anticipates the worst-case sequence of gains; when
483: Nature is not optimally adversarial, then a proper setting of $\eta$ may
484: require other prior knowledge.
485:
486: \subsection{Simulations}
487:
488: \subsubsection{The effect of good experts}
489:
490: To empirically compare Hedge and NormalHedge, we first simulated the two
491: algorithms in a scenario similar to that described in
492: Section~\ref{sec:drifting-game}:
493: \begin{itemize}
494: \item The number of experts is $N = 1000$, and the discount parameter is
495: $\alpha = 0.001$.
496: \item At any given time, there is a set of $N_G = f \cdot N$ good experts
497: and $N - N_G$ bad experts. (We varied $f \in \{ 0.001, 0.01, 0.1, 0.5 \}$.)
498: \begin{itemize}
499: \item With probability $0.5 + \gamma/2$, \emph{every} good expert
500: receives gain $+1$; with probability $0.5 - \gamma/2$, \emph{every} good
501: expert receives gain $-1$. (We varied $\gamma \in \{ 0.2, 0.4, 0.6, 0.8
502: \}$.)
503: \item Bad experts receive gain $+1$ and $-1$ with equal probability.
504: \end{itemize}
505: \item Initially, the set of good experts is $\{ 0, 1, \ldots, N_G-1 \}$.
506: \item After every $1/\alpha$ iterations, the set of good experts shifts
507: from $\{ i_0, i_0 + 1, \ldots, i_0 + N_G-1 \}$ to $\{ i_0 + N_G, i_0 +
508: N_G+1, \ldots, i_0 + 2N_G-1 \}$ (with addition modulo $N$).
509: \end{itemize}
510: Thus, the set of good experts completely changes every $1/\alpha$
511: iterations. In each iteration, all good experts receive the same gain,
512: which is $\gamma$ in expectation. In contrast, the gain of each bad expert
513: is decided independently with a fair coin.
514:
515: We tuned the learning rate parameter for Hedge to $\eta = \sqrt{(\alpha -
516: \alpha^2/2) \ln N}$. For NormalHedge, we varied $c \in \{ 1, 2, 4\}$.
517: Recall that the regret bound we can show for NormalHedge holds for $c = 2$
518: as $\alpha \to 0$ (the formal proof is stated with $c = 4$).
519:
520: Figures~\ref{fig:sim1-1} and \ref{fig:sim1-2} depict the discounted
521: cumulative regret to the best expert (averaged over $50$ runs). First, we
522: observe that NormalHedge fares better than Hedge when the advantage of the
523: good experts is large and the fraction of experts that are good is large.
524: In such cases, the advantage of NormalHedge is especially pronounced within
525: $1/\alpha$ iterations (before the set of good experts shifts). Second, we
526: observe that the performance of NormalHedge generally improves as the value
527: of $c$ is decreased. Indeed, the setting of $c = 1$ (for which we have no
528: theoretical guarantees) yields the best results for NormalHedge (and in
529: fact outperforms Hedge in every simulation). It would be very interesting
530: to establish guarantees for NormalHedge for $c \to 1$.
531:
532: \begin{figure}
533: \begin{center}
534: \begin{tabular}{cc}
535: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.00100-reg.eps} &
536: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.00100-reg.eps} \\
537: $\gamma = 0.2, f = 0.001$ & $\gamma = 0.4, f = 0.001$ \\
538: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.00100-reg.eps} &
539: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.00100-reg.eps} \\
540: $\gamma = 0.6, f = 0.001$ & $\gamma = 0.8, f = 0.001$ \\
541: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.01000-reg.eps} &
542: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.01000-reg.eps} \\
543: $\gamma = 0.2, f = 0.01$ & $\gamma = 0.4, f = 0.01$ \\
544: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.01000-reg.eps} &
545: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.01000-reg.eps} \\
546: $\gamma = 0.6, f = 0.01$ & $\gamma = 0.8, f = 0.01$
547: \end{tabular}
548: \end{center}
549: \caption{Regrets to the best expert in the first simulation;
550: $\gamma \in \{ 0.2, 0.4, 0.6, 0.8 \}$ and $f \in \{ 0.001, 0.01 \}$.}
551: \label{fig:sim1-1}
552: \end{figure}
553:
554: \begin{figure}
555: \begin{center}
556: \begin{tabular}{cc}
557: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.10000-reg.eps} &
558: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.10000-reg.eps} \\
559: $\gamma = 0.2, f = 0.1$ & $\gamma = 0.4, f = 0.1$ \\
560: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.10000-reg.eps} &
561: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.10000-reg.eps} \\
562: $\gamma = 0.6, f = 0.1$ & $\gamma = 0.8, f = 0.1$ \\
563: \includegraphics[width=0.43\textwidth]{plots/expt-0.20000-0.50000-reg.eps} &
564: \includegraphics[width=0.43\textwidth]{plots/expt-0.40000-0.50000-reg.eps} \\
565: $\gamma = 0.2, f = 0.5$ & $\gamma = 0.4, f = 0.5$ \\
566: \includegraphics[width=0.43\textwidth]{plots/expt-0.60000-0.50000-reg.eps} &
567: \includegraphics[width=0.43\textwidth]{plots/expt-0.80000-0.50000-reg.eps} \\
568: $\gamma = 0.6, f = 0.5$ & $\gamma = 0.8, f = 0.5$
569: \end{tabular}
570: \end{center}
571: \caption{Regrets to the best expert in the first simulation;
572: $\gamma \in \{ 0.2, 0.4, 0.6, 0.8 \}$ and $f \in \{ 0.1, 0.5 \}$.}
573: \label{fig:sim1-2}
574: \end{figure}
575:
576: \subsubsection{The effect of tuning $\eta$ in Hedge}
577:
578: Next, to bring out the issue with parameter tuning in Hedge, we conducted a
579: simulation in which we fix the fraction of experts that are good, but vary
580: the total number of experts:
581: \begin{itemize}
582: \item The number of experts is $N$, and the discount parameter is
583: $\alpha = 0.001$. (We varied $N \in \{ 10, 100, 1000 \}$.)
584: \item The fraction of experts that are good is fixed at $f = 0.1$. The
585: notion of good and bad experts is the same as in the first simulation. (We
586: varied $\gamma \in \{ 0.2, 0.8 \}$.)
587: \item The remaining details are the same as in the first simulation.
588: \end{itemize}
589: Again, we tuned the learning rate parameter for Hedge to $\eta =
590: \sqrt{(\alpha - \alpha^2/2) \log N}$, which now changes as we vary the
591: total number of experts, and we varied $c \in \{ 1, 2, 4 \}$ in
592: NormalHedge.
593:
594: The results (Figure~\ref{fig:sim2-1}) indicate that as $N$ decreases
595: (e.g.~$N = 100, 10$), the disparity between Hedge and NormalHedge
596: increases. We believe this is an issue with tuning the learning rate
597: $\eta$, which is conspicuously absent in NormalHedge, but we have not
598: precisely characterized the issue.
599:
600: \begin{figure}
601: \begin{center}
602: \begin{tabular}{cc}
603: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-1000-reg.eps} &
604: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-1000-reg.eps} \\
605: $\gamma = 0.2, N = 1000$ & $\gamma = 0.8, N = 1000$ \\
606: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-100-reg.eps} &
607: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-100-reg.eps} \\
608: $\gamma = 0.2, N = 100$ & $\gamma = 0.8, N = 100$ \\
609: \includegraphics[width=0.43\textwidth]{plots/expt2-0.20000-10-reg.eps} &
610: \includegraphics[width=0.43\textwidth]{plots/expt2-0.80000-10-reg.eps} \\
611: $\gamma = 0.2, N = 10$ & $\gamma = 0.8, N = 10$
612: \end{tabular}
613: \end{center}
614: \caption{Regrets to the best expert in the second simulation;
615: $\gamma \in \{ 0.2, 0.8 \}$ and $N \in \{ 1000, 100, 10 \}$.}
616: \label{fig:sim2-1}
617: \end{figure}
618:
619: %
620: %
621: %
622: %\begin{figure}
623: %\begin{center}
624: %\includegraphics[width=0.5\textwidth]{shifting-reg-1000.eps} \\
625: %\vskip 0.2in
626: %\includegraphics[width=0.5\textwidth]{shifting-reg-100.eps} \\
627: %\vskip 0.2in
628: %\includegraphics[width=0.5\textwidth]{shifting-reg-10.eps}
629: %\vskip 0.1in
630: %\end{center}
631: %\caption{Regrets in the second simulation.} \label{fig:sim2}
632: %\end{figure}
633: %
634: %
635: \section{Inferring latent random variables} \label{sec:latent}
636:
637: An important problem in statistical inference is to make predictions
638: or choose actions when the system under consideration has internal
639: states that cannot be observed directly. There are many manifestations
640: of this problem, including Graphical models, Hidden Markov Models
641: (HMMs), Partially Observable Markov Decision Processes (POMDPs) and
642: Kalman filters. The common method for dealing with hidden states is to
643: model them as {\em latent random variables}. The relation between the
644: latent random variables and the observable random variables is modeled
645: using a joint probability distribution. Two very important
646: sub-problems that arise in this approach are learning joint
647: distributions the involve latent random variables from examples that
648: contain only the state of the observable random variables and using
649: this type of joint distributions to infer the value of some
650: variables given the state of others. At this time there is no good
651: universal solution to either of these sub-problems.
652:
653: We propose a different approach to the problem, where instead of
654: associating hidden states with hidden random variables, we associate
655: states with different experts. What we present here describes some
656: initial ideas. It is not an attempt to propose a solution to this
657: large and complex problem.
658:
659: Suppose that we are to predict a binary sequence $x_1,x_2,\ldots$,
660: $x_t \in \{0,1\}$ and suppose that we believe that the sequence can be
661: predicted reasonably well using a Hidden Markov Model. Specifically,
662: suppose there is a hidden state $S$ which attains one of the values
663: $1,\ldots,k$ at each time step. Suppose that the state transition is
664: Markovian and stationary, i.e.
665: \[
666: P(S_t |S_{t-1},S_{t-2},\ldots) = P(S_t | S_{t-1}) = P(S_{t-1}|S_{t-2})
667: = \cdots
668: \]
669: Assume in addition that the hidden state does not change very often,
670: i.e. $P(S_{t+1} = S_t)$ is close to 1. Finally, assume that the
671: distribution of the observable variable $X_t$ depends only on the
672: hidden state at the same time $S_t$.
673:
674: Consider the problem of predicting $X_{t+1}$ given $x_1,\ldots,x_t$
675: and the parameters of the HMM. Suppose that the prediction needs to
676: take the form of a distribution over $\Sigma$. So far this is exactly
677: the standard framework, but suppose we differ from the standard
678: framework by considering the $L_1$ loss $1-p_t(x_t)$, where $p_t(x_t)$
679: is the predicted probability assigned to the letter that actually
680: occured at time $t$. This is instead of the standard log likelihood
681: loss $\log(1/p_t(x_t)$. While the log loss is easier to analyze, the
682: $L_1$ loss is often a more useful measure because the cumulative $L_1$
683: loss corresponds to the expected number of mistakes. While this loss
684: does not fit well in the maximal likelihood or Bayesian methodologies,
685: it fits NormalHedge very well, because the loss per-iteration is bounded.
686:
687: Here is our proposal for solving the prediction problem using
688: NormalHedge. We associate a set of experts with each hidden state. The
689: experts are confidence rated, i.e. each one of the experts outputs a
690: confidence level $0 \leq c \leq 1$ at each time step, the confidence
691: level is used in the confidence rated variant of NormalHedge described
692: in the previous section. If expert $i$ corresponds to a hidden state
693: $j$ then $c_i$ should be large when $S=j$ and low when $S \neq
694: j$. Suppose that the parameters of the HMM are known, then we can
695: associate a single expert with each hidden state and compute the
696: prediction and the confidence value of that expert using Bayes
697: formula.
698:
699: Now suppose that we don't know the parameter vector of the HMM but
700: that we know that the vector is one of $N$ possibilities. In this case
701: we associate $N$ experts with each hidden state and compute the
702: predictions and confidence value of each expert using Bayes Formula
703: for the corresponding parameter vector, the confidence value for each
704: state is the a-posteriori probability for that state.
705:
706: In this case the NormalHedge algorithm will quickly converge and give
707: most of the weight to the experts that correspond to the correct
708: parameter vector. Moreover, if none of the parameter vectors is
709: a correct description of the sequence distribution, it will converge
710: on the vector which causes the least regret, i.e. makes the smallest
711: number of mistakes.
712:
713: Contrast this with the Bayesian approach. If the true distribution
714: generating the data is not included in the set of models over which we
715: take the posterior average, and if the loss function in which we are
716: interested is not log-likelihood but rather number of mistakes. Then
717: the cumulative loss of the Bayesian average can be much larger than
718: that of the best model in the set.
719:
720: \section{Open problems}
721:
722: The most interesting open problem is to close the gap between the
723: upper bound and lower bound on the parameter $c$. We have a lower
724: bound of $c>1$ and an upper bound of $c=4$. If we consider the case
725: $\alpha \to 0$ we can reduce $c$ to $2$. However, the gap between
726: $c=1$ and $c=2$ remains.
727:
728: One promising direction of expansion is to consider the game in the
729: continuous time limit directly. This leads us naturally into
730: stochastic processes in continuous time such as Wiener
731: processes. Understanding the performance of NormalHedge in this
732: context might yield new methods for stochastic estimation and
733: stochastic control.
734:
735: %%--------------------------------------------------------------------}
736: \bibliography{bib} \bibliographystyle{alpha}
737: %%--------------------------------------------------------------------}
738:
739:
740: \appendix
741:
742: \section{Proof of main theorem}
743:
744: Recall, the cumulative discounted regret of action $i$ at time $t=j\alpha$,
745: $j \in \N$ is defined recursively by
746: \[ R_i(0) = 0, \quad R_i(t+\alpha) = (1-\alpha) R_i(t) + g_i(t) - g_A(t),
747: \]
748: where $g_i(t) \in [-\sqrt{\alpha}, +\sqrt{\alpha}]$ is the (scaled) gain of
749: action $i$ at time $t$, and $g_A(t) \in [-\sqrt{\alpha}, +\sqrt{\alpha}]$
750: is the (scaled) gain of the hedger at time $t$. We define $r_i(t) = (g_i(t)
751: - g_A(t)) / \sqrt{\alpha} \in [-2,+2]$ as the (unscaled) instantaenous
752: regret to action $i$ at time $t$. The central quantity of interest is the
753: \emph{average potential}
754: \[ \Psi(t) = \frac1N \sum_{i=1}^N \Phi(R_i(t)). \]
755: Recall, we use the definition of the potential function $\Phi$
756: in Equation~\eqref{eqn:potential} with $c = 4$.
757:
758: \begin{claim}
759: There exists a positive constant $C \leq 2.32$ such that if $\alpha <
760: 1/(800\ln CN)$, then the average potential is always bounded from above by
761: $C$; that is, $\Psi(j\alpha) < C$ for any $j \in \N$.
762: \end{claim}
763: \begin{proof}
764: Fix $j \in \N$ and let $t = j\alpha$.
765:
766: We will analyze the average $\Psi(t+\alpha) - \Psi(t)$ by considering
767: the averages over two separate groups:
768: \[ I_1 = \{ i : R_i(t) \leq 0 \} \quad \text{and} \quad I_2 = \{ i : R_i(t)
769: > 0 \}. \]
770:
771: Let $\Psi_k(t) = (1/|I_k|) \sum_{i \in I_k} \Phi(R_i(t))$ be the average
772: potential for $I_k$, $k = 1, 2$ (assume without loss of generality that
773: neither $I_k$ is empty). We'll show the following facts:
774: \begin{enumerate}
775: \item[(A):] $\Psi_1(t) = 1$ and $\Psi_1(t+\alpha) < 1 + (3/5)\alpha$;
776: \item[(B):] If $\Psi(t) < 2.32$, then $\Psi_2(t+\alpha) - \Psi_2(t) < (2/3)\alpha$;
777: \item[(C):] If $2.31 < \Psi(t) < 2.32$, then $\Psi(t+\alpha) < \Psi(t)$.
778: \end{enumerate}
779: These facts imply that the increase in average potential from $\Psi(t)$ to
780: $\Psi(t+\alpha)$ is always less than $(2/3)\alpha < 1/1200$, and that if
781: the average potential $\Psi(t)$ is strictly between $2.31$ and $2.32$,
782: then $\Psi(t+\alpha)$ is strictly less than $\Psi(t)$. The claim then
783: follows by induction because $\Psi(0) = 1$.
784:
785: We now prove the facts (A), (B), and (C).
786:
787: (A): For $i \in I_1$, $\Phi(R_i(t)) = 1$ and $R_i(t+\alpha) \leq (1-\alpha)
788: R_i(t) + |r_i(t)|\sqrt{\alpha} \leq 2\sqrt{\alpha}$. Since $\Phi(x)$ is
789: non-decreasing in $x$, we have $\Phi(R_i(t+\alpha)) \leq
790: \Phi(2\sqrt{\alpha}) = e^{\alpha/2} < 1+\alpha/2+\alpha^2e^{\alpha/2}/2 <
791: 1+(3/5)\alpha$ (the last inequality follows from the upper bound on
792: $\alpha$).
793:
794: (B): We address terms in $I_2$ by expanding $\Phi(R_i(t+\alpha))$ around
795: the point $R_i(t) \neq 0$ via Taylor's theorem:
796: \[ \Phi(R_i(t+\alpha)) = \Phi(R_i(t)) + d_i(t) \Phi'(R_i(t)) + \frac12
797: d_i(t)^2 \Phi''(\rho_i) \]
798: where $d_i(t) = r_i(t) \sqrt{\alpha} - \alpha R_i(t)$ and $\rho_i \in \R$
799: lies between $R_i(t)$ and $R_i(t+\alpha)$. Because the hedger's weights are
800: chosen so that $p_i(t) \propto \Phi'(R_i(t))$, we have that
801: \[ \sum_{i=1}^N g_i(t) \Phi'(R_i(t)) - g_A(t) \sum_{i=1}^N \Phi'(R_i(t)) =
802: 0 \]
803: and thus
804: \[ \Phi(R_i(t+\alpha)) - \Phi(R_i(t)) = -\alpha R_i(t) \Phi'(R_i(t)) +
805: \frac12 d_i(t)^2 \Phi''(\rho_i). \]
806: We need a few bounds before proceeding. First, if $\Psi(t) < 2.32$, then
807: $\Phi(R_i(t)) < 2.32N$ for all $i$, which implies $R_i(t) <
808: \sqrt{8\ln(2.32N)}$ for all $i$. By the condition on $\alpha$, we also have
809: $\sqrt{\alpha}R_i(t) < 1/10$. Next, we use a bound on $\rho_i$ since it is
810: evaluated in the non-decreasing function $\Phi''(x)$:
811: \[ \rho_i^2 \ \leq \ \max\{R_i(t), R_i(t+\alpha)\}^2 \ \leq
812: \ (R_i(t) + |r_i(t)|\sqrt{\alpha})^2
813: \ = \ R_i(t)^2 + 2\sqrt{\alpha}R_i(t)|r_i(t)| + \alpha r_i(t)^2
814: \ \leq \ R_i(t)^2 + \frac12. \]
815: Finally, we bound $d_i(t)^2$ as follows:
816: \[ d_i(t)^2 \ \leq \ (|r_i(t)|\sqrt{\alpha} + \alpha R_i(t))^2 \ \leq
817: \ (2+1/10)^2\alpha \ \leq \ \frac92 \alpha. \]
818: Altogether, we have
819: \begin{align*}
820: \Phi(R_i(t+\alpha)) - \Phi(R_i(t))
821: & = - \alpha R_i(t) \Phi'(R_i(t)) + \frac12 d_i(t)^2 \Phi''(\rho_i) \\
822: & = - \alpha R_i(t) \frac{R_i(t)}{4} e^{R_i(t)^2/8} + \frac12 d_i(t)^2
823: \left( \frac14 + \frac{\rho_i^2}{16} \right) e^{\rho_i^2/8} \\
824: & \leq - \alpha \frac{R_i(t)^2}{4} e^{R_i(t)^2/8} + \frac{9\alpha}{4}
825: \left( \frac9{32} + \frac{R_i(t)^2}{16} \right) e^{R_i(t)^2/8} e^{1/16} \\
826: & \leq \alpha \left(\frac23 - \frac1{10} R_i(t)^2\right) e^{R_i(t)^2/8}.
827: \end{align*}
828: The final bound is decreasing as a function of $R_i(t) \geq 0$. This
829: implies $\Phi(R_i(t+\alpha)) - \Phi(R_i(t)) \leq (2/3)\alpha$, so
830: $\Psi_2(t+\alpha) - \Psi_2(t) < (2/3)\alpha$.
831:
832: (C): First, consider the problem of maximizing
833: \[ f(x_1, \ldots, x_n) = \sum_{i=1}^n \left( \frac23 - \frac{x_i^2}{10}
834: \right) e^{x_i^2/8} \]
835: subject to the constraint $(1/n) \sum_{i=1}^n e^{x_i^2/8} \geq B$ for some
836: $B \geq 1$. Simple variational arguments imply that the maximum is attained
837: when $x_i = \sqrt{8\ln B}$ for all $i$. Therefore, following the argument
838: for (B), we have that if $\Psi_2(t) \geq B$ for some $B \geq 1$, then
839: \[ \Psi_2(t+\alpha) - \Psi_2(t) \leq \alpha \cdot B \cdot \left( \frac23 -
840: \frac45 \ln B \right). \]
841:
842: Let $p_1 = |I_1|/N$ and $p_2 = 1 - p_1$. Suppose
843: $\Psi(t) > 2.31$. Because $\Psi_1(t) = 1$, we have
844: \[ \Psi_2(t)
845: = \frac1{p_2} \left( \Psi(t) - p_1 \right) \geq \frac1{p_2} (2.31 - p_1)
846: \doteq B. \]
847:
848: Now we analyze the overall change in average potential. By (A), the
849: increase in average potential over $i \in I_1$ is less than
850: $(3/5)\alpha$. Then
851: \begin{align*}
852: \frac{\Psi(t+\alpha) - \Psi(t)}{\alpha}
853: & < p_1 \cdot \frac35 + p_2 \cdot B \cdot \left( \frac23 - \frac45 \ln B
854: \right) \\
855: & = p_1 \cdot \frac35 + (2.31 - p_1) \cdot \left( \frac23 - \frac45 \ln
856: \frac1{1-p_1} - \frac45 \ln (2.31 - p_1) \right).
857: \end{align*}
858:
859: The final RHS is decreasing as a function of $p_1 \geq 0$, so it is
860: maximized when $p_1 = 0$. Making this substitution, the RHS is negative,
861: and thus $\Psi(t+\alpha) < \Psi(t)$.
862: \end{proof}
863:
864:
865:
866:
867: %%\proof
868: %We define the {\em average potential}
869: %\[
870: %\Psi(t) \doteq {1 \over N} \sum_{i=1}^N \Phi \paren{\ctR{i}{t}}
871: %\]
872: %And study the time evolution of $\Psi(t)$.
873: %
874: %Clearly, $\Psi(0)=1$.
875: %
876: %We fix $\alpha$ and study the change in the average potential from
877: %$t=j\alpha$ to $t+\alpha = (j+1)\alpha$
878: %\begin{eqnarray}
879: %\lefteqn{\Psi(t+\alpha) - \Psi(t)} && \label{eqn:proof-1}\\
880: %&=&{1 \over N} \sum_i
881: %\Phi \paren{(1-\alpha)\ctR{i}{t} + \ctg{i}{t}-\ctg{A}{t} }
882: %- \Phi \paren{ \ctR{i}{t} } \label{eqn:proof-2}
883: %\end{eqnarray}
884: %
885: %We use $\ctd{i}{t}$ to denote $\ctR{i}{t+\alpha} - \ctR{i}{t}$ and
886: %$\ctr{i}{t}$ to denote $(\ctg{i}{t}-\ctg{A}{t})/\sqrt{\alpha}$. It follows directly from the
887: %definitions that $\ctd{i}{t} = \ctr{i}{t} \sqrt{\alpha} - \ctR{i}{t} \alpha$
888: %and that $|\ctr{i}{t}| \leq 2$.
889: %
890: %We perform a Taylor expansion of Equation~\eqref{eqn:proof-2}:
891: %\begin{equation} \label{eqn:proof-3}
892: %\sum_i \Phi \paren{\ctR{i}{t+\alpha}} - \Phi \paren{\ctR{i}{t}}
893: %=\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) + \half \ctd{i}{t}^2 \Phi''(\ctR{i}{t'})
894: %\end{equation}
895: %For some $t'$, $t \leq t' \leq t+\alpha$.
896: %
897: %We start by analyzing the first term
898: %\[
899: %\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) =
900: %\sum_i \paren{\ctg{i}{t} - \ctg{A}{t} - \alpha \ctR{i}{t}} \Phi'(\ctR{i}{t})
901: %\]
902: %It follows from the definition of the NH algorithm that
903: %\[
904: %\ctg{A}{t} = \frac{\sum_i \ctg{i}{t} \Phi'(\ctR{i}{t})}{\sum_i \Phi'(\ctR{i}{t})}
905: %\]
906: %From which it follows that:
907: %\[
908: %\sum_i \ctd{i}{t} \Phi'(\ctR{i}{t}) = - \alpha \sum_i \ctR{i}{t} \Phi'(\ctR{i}{t})
909: %\]
910: %
911: %As $\Phi''(x)$ is an increasing function of $x$ we can bound rewrite
912: %Equation~\eqref{eqn:proof-3} as follows:
913: %\begin{eqnarray} \label{eqn:proof-4}
914: %\lefteqn{\Psi(t+\alpha) - \Psi(t)}
915: %&& \\
916: %&\leq&
917: %{1 \over N}
918: %\sum_i \half \paren{\ctr{i}{t}\sqrt{\alpha} - \ctR{i}{t} \alpha}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})
919: % - \alpha \ctR{i}{t} \Phi'(\ctR{i}{t}) \nonumber
920: %\\
921: %&=&
922: %{\alpha \over N}
923: %\paren{\sum_i \half \paren{\ctr{i}{t} - \ctR{i}{t} \sqrt{\alpha}}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})
924: % - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \label{eqn:proof-5}
925: %\end{eqnarray}
926: %
927: %We are now ready to use the assumption that $\alpha$ is small. We do
928: %this in two steps. First, we assume $\alpha \to 0$, which makes for a
929: %simpler argument. Then we come back and show that requiring
930: %$\alpha < {1 \over 800 \ln 2N}$ is sufficient.
931: %
932: %
933: %{\bf Infinitesimally small $\alpha$}\\
934: %We divide both sides of Equation~\eqref{eqn:proof-5} by $\alpha$ and take the limit when $\alpha \to 0$ to get
935: %\begin{eqnarray}
936: %{d \over dt} \Psi(t) &\leq&
937: %{1 \over N}
938: %\paren{\sum_i \half \ctr{i}{t}^2 \Phi''(\ctR{i}{t})
939: % - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \label{eqn:proof-6}
940: %\\
941: %&=&
942: %{1 \over N}
943: %\paren{\sum_{i; \ctR{i}{t}\geq 0} \half \ctr{i}{t}^2
944: % \paren{{1 \over c} +{\ctR{i}{t}^2 \over c^2}}
945: % \exp \paren{\ctR{i}{t}^2 \over 2c}
946: % - {\ctR{i}{t}^2 \over c}
947: % \exp \paren{\ctR{i}{t}^2 \over 2c}
948: %}
949: %\nonumber
950: %\\
951: %&\leq&
952: %{1 \over N}
953: %\paren{\sum_{i; \ctR{i}{t}\geq 0} 2
954: % \paren{{1 \over c} +{\ctR{i}{t}^2 \over c^2}}
955: % \exp \paren{\ctR{i}{t}^2 \over 2c}
956: % - {\ctR{i}{t}^2 \over c}
957: % \exp \paren{\ctR{i}{t}^2 \over 2c}
958: %}
959: %\\
960: %& \leq &
961: %{1 \over cN}
962: %\paren{\sum_{i; \ctR{i}{t}\geq 0}
963: % 2 \exp \paren{\ctR{i}{t}^2 \over 2c}
964: % + \paren{{2 \over c} - 1}
965: % \ctR{i}{t}^2
966: % \exp \paren{\ctR{i}{t}^2 \over 2c}
967: %}
968: %\\
969: %& \leq &
970: %{1 \over c}
971: %\paren{2 \Psi(t)
972: % + {1 \over cN}\paren{{2 \over c} - 1}
973: % \sum_i \ctR{i}{t}^2
974: % \exp \paren{\ctR{i}{t}^2 \over 2c}
975: %}
976: %\end{eqnarray}
977: %plugging in the choice $c=4$ we get that
978: %\begin{equation}
979: %{d \over dt} \Psi(t) \leq
980: %{1 \over c}
981: %\paren{2 \Psi(t)
982: % - {1 \over 2N}
983: % \sum_i \ctR{i}{t}^2
984: % \exp \paren{\ctR{i}{t}^2 \over 2c}
985: %}
986: %\end{equation}
987: %We now find a condition under which the difference on
988: %the RHS is negative. Assuming
989: %$\Psi(t)=(1/N)\sum_i \exp \paren{\ctR{i}{t}^2 \over 2c}=A$, it is easy to verify that
990: %$(1/N)\sum_i \ctR{i}{t}^2 \exp \paren{\ctR{i}{t}^2 \over 2c}$ is
991: %minimized when all of the regrets are equal, which implies that
992: %\[
993: %\ctR{i}{t} = \ctR{}{t} = \sqrt{2c\ln(A)} = \sqrt{8\ln(A)}
994: %\]
995: %If $A \geq 2$ then
996: %\[
997: %{1 \over 2N} \sum_i \ctR{i}{t}^2 \exp \paren{\ctR{i}{t}^2 \over 2c}
998: %\geq
999: %{1 \over 2} \ctR{}{t}^2 A
1000: %=
1001: %{1 \over 2} 8 \ln(A) A \geq 8 \ln(2) A > 2A
1002: %\]
1003: %Thus if $\Psi(t) \geq 2$, ${d \over dt} \Psi(t) <0$ and as
1004: %$\Psi(0)=1$ and $\Psi(t)$ is a continuous and differentiable function
1005: %of $t$, $\Psi(t) < 2$ for all $t$, which completes the proof for
1006: %$\alpha \to \infty$.
1007: %
1008: %{\bf small but finite $\alpha$}\\
1009: %We now show that requiring $\alpha < {1 \over 800 \ln 4N}$ is
1010: %sufficient. As $\alpha>0$ is fixed we can use induction over the
1011: %iterations of the game. We divide the range $[0,4]$ into two
1012: %sub-ranges: $[0,3]$ and $(3,4)$. We show that, for every iteration $j$
1013: %\begin{enumerate}
1014: %\item If $\Psi(j \alpha) \leq 3$ then $\Psi((j+1)\alpha) <4$.
1015: %\item If $3 < \Psi(j \alpha) < 4$ then $\Psi((j+1) \alpha) < \Psi(j
1016: % \alpha)$.
1017: %\end{enumerate}
1018: %Using induction over these two conditional statements we get that
1019: %$\Psi(j) \leq 4$ for all $j$.
1020: %
1021: %We will now prove the two conditional statements.
1022: %
1023: %We first prove an upper bound on $\ctR{i}{t}$ conditioned on the
1024: %assumption that $\Psi(t)<4$. As the potential is non-negative we get that
1025: %\[
1026: %\forall i,\;\; \Phi \paren{\ctR{i}{j\alpha}} \leq 4N
1027: %\]
1028: %From which it follows that
1029: %\[
1030: %\forall i,\;\; \ctR{i}{j \alpha} \leq \sqrt{8 \ln 4N}
1031: %\]
1032: %\newcommand{\Rm}{R_m}
1033: %We use $\Rm$ to denote $\sqrt{8 \ln 4N}$ and note that requiring
1034: %$\alpha < {1 \over 800 \ln 4N}$ implies that $\sqrt{\alpha} \ctR{i}{t}
1035: %< 1/10$.
1036: %
1037: %We now expand Equation~\eqref{eqn:proof-4} without taking the limit $\alpha \to 0$.
1038: %\begin{eqnarray}
1039: %\lefteqn{\Psi(t+\alpha) - \Psi(t)}
1040: %&& \\
1041: %&\leq&
1042: %{\alpha \over N}
1043: %\sum_i \paren{ \half \paren{\ctr{i}{t} - \ctR{i}{t} \sqrt{\alpha}}^2 \Phi''(\ctR{i}{t}+|\ctr{i}{t}|\sqrt{\alpha})
1044: % - \ctR{i}{t} \Phi'(\ctR{i}{t}) } \\
1045: %&\leq&
1046: %{\alpha \over N} \left[
1047: %\sum_{i; \ctR{i}{t}\geq -2\sqrt{\alpha}} \paren{
1048: % \half \paren{\ctr{i}{t} + \sqrt{\alpha}\ctR{i}{t}}^2
1049: % \paren{{1 \over c} +
1050: % {\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2 \over c^2}}
1051: % \exp\paren{\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2\over 2c}
1052: % } \right. \\
1053: %&& \left.
1054: %-\sum_{i; \ctR{i}{t}\geq 0} \paren{
1055: % {\ctR{i}{t}^2 \over c} \exp \paren{\ctR{i}{t}^2 \over 2c}
1056: % }
1057: %\right]
1058: %\nonumber
1059: %\end{eqnarray}
1060: %We first consider the terms in the first sum which have no matching
1061: %terms in the second sum. In other words, indices $i$ for which
1062: %$-2\sqrt{\alpha} \leq \ctR{i}{t} <0$. It is easy to show that because
1063: %$c=4$, $\sqrt{\alpha} \ctR{i}{t}< 1/10$, $|\ctr{i}{t}|<2$, these terms
1064: %are smaller than $1$. As for these terms $\Phi(\ctR{i}{t}=0$, the
1065: %result is that $\Phi(\ctR{i}{t+\alpha})<1$. Thus these terms cannot
1066: %increase the overall average potential at $t+\alpha$ beyond $1$ and so
1067: %we can ignore them.
1068: %
1069: %We thus consider only terms for which $\ctR{i}{t}\geq 0$ so that there
1070: %is a term in both sums. Using the facts that $c=4$, $\sqrt{\alpha}
1071: %\ctR{i}{t}< 1/10$, $|\ctr{i}{t}|<2$ and $\alpha < 1/800$ we can show
1072: %that
1073: %\begin{eqnarray}
1074: %\lefteqn{
1075: %\sum_{i; \ctR{i}{t}\geq 0}
1076: % \half \paren{\ctr{i}{t} + \sqrt{\alpha}\ctR{i}{t}}^2
1077: % \paren{{1 \over c} +
1078: % {\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2 \over c^2}}
1079: % \exp\paren{\paren{\ctR{i}{t} + \sqrt{\alpha}|\ctr{i}{t}|}^2\over 2c}
1080: %- {\ctR{i}{t}^2 \over c} \exp \paren{\ctR{i}{t}^2 \over 2c}
1081: %} && \nonumber \\
1082: %& \leq &
1083: %\sum_{i; \ctR{i}{t}\geq 0}
1084: % \half (2.1)^2
1085: %\paren{{1 \over 4}+{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 16}}
1086: %\exp \paren{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 8}
1087: %-{\ctR{i}{t}^2 \over 4} \exp \paren{\ctR{i}{t}^2 \over 8}
1088: %\nonumber \\
1089: %& \leq &
1090: %\sum_{i; \ctR{i}{t}\geq 0}
1091: %\brackets{
1092: % \half (2.1)^2
1093: %\paren{{1 \over 4}+{\paren{\ctR{i}{t}+2\sqrt{\alpha}}^2 \over 16}}
1094: %\exp \paren{\ctR{i}{t}\sqrt{\alpha} + \alpha \over 2}
1095: %-{\ctR{i}{t}^2 \over 4}
1096: %}
1097: %\exp \paren{\ctR{i}{t}^2 \over 8}
1098: %\nonumber \\
1099: %& \leq &
1100: %\sum_{i; \ctR{i}{t}\geq 0}
1101: %\brackets{0.6 - 0.1 \ctR{i}{t}^2}
1102: %\exp \paren{\ctR{i}{t}^2 \over 8}
1103: %\nonumber
1104: %\end{eqnarray}
1105: %For a fixed value of $\Psi(t)$, this sum is maximized when
1106: %$\ctR{i}{t}$ are all equal to $\sqrt{\ln \Psi{t}}$. Claim 1. above
1107: %follows from the fact that the increase in the average potential in a
1108: %single step is at most $0.6$. Claim 2. follows from the fact that if
1109: %$\Psi{t} \geq 3$ then $\sqrt{\ln \Psi{t}} \geq 2.9$ and setting
1110: %$\ctR{i}{t}=2.9$ for all $i$ we find that $\Psi(t+\alpha) < \Psi(t)$.
1111: %
1112: %%\qed
1113: %
1114: \end{document}
1115:
1116: